Take Home Exercise 03

Author

Henry Low

Published

02 11 2024

Modified

08 11 2024

3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods

Background

Housing in Singapore has always been a hot topic, with purposes of investment to accommodation. There are many factors affecting the price, mainly divded into structural and location factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

However, we also need to consider that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998)

Objectives

Using multiple linear regression and geographical weighted models, I would like to:
- Predict housing prices based on structural and locational factors.
- Incorporate geographical factors to enhance the predictive model.
- Compare the created models

Data Sources

(saved under ‘data’ folder)

1 Setting Up

1.1 Loading R Packages

I will be using the following R packages:
-sf package to perform geospatial wrangling tasks 
- tidyverse package for reading csv files, dataframe processing tasks 
- matrixStats package for matrix processing 
- ggplot2, ggpubr and ggstatsplot package for plotting statistical graphics 
- heatmaply package for attribute standardisation 
- gtsummary package to create regression reports 
- olsrr package to perform diagnostic tests 
- Metrics and performance package to evaluate model performance 
- DT package for visualizing results in a table format 
- spdep package for performing spatial autocorrelation on residuals 
- GWmodel package to compute optimal adaptive bandwidth 
- SpatialML package for calibrating geographical random forest model  
- httr, jsonlite and rvest package for getting latitude and longitude data  
- tmap package for plotting tasks 

pacman::p_load(sf, tidyverse, matrixStats, ggplot2, ggpubr, ggstatsplot, heatmaply,
               gtsummary, olsrr, Metrics, performance, DT, spdep,
               GWmodel, SpatialML, httr, jsonlite, rvest, tmap)

2 Procuring Dataset

Note

Steps used in this sections is largely adapted from the Take-home_Ex3b-HDBDataPrep provided by Prof Kam and other data preprocessing steps by Hao Xian, Wen Yang and Pierre Jean Michel.

The goal of this section is to supplement the resale flat price data from data.gov.sg with other geospatial and locational factors. While my seniors have provided useful links as to where those information can be obtained, I’ll cross verify it with other updated sources where possible. A quick check also showed that some sources are no longer available through OneMap API, therefore further effort is required to get those information.

Factor Type Source
Proximity to CBD Location -
Proximity to Eldercare Location OneMap API, data.gov.sg
Proximity to Foodcourt/Hawker Centres Location data.gov.sg
Proximity to MRT Location LTA Data Mall
Proximity to Park Location OneMap API, data.gov.sg
Proximity to Good Primary School Location OneMap API (list from MOE Website)
Proximity to Shopping Mall Location OpenStreetMap, OneMap API (Wikipedia)
Proximity to Supermarket Location data.gov.sg
No. of Kindergartens within 350m Location OneMap API, data.gov.sg
No. of Childcare Centres within 350m Location OneMap API
No. of Bus Stop within 350m Location LTA Data Mall
No. of Primary Schools within 1km Location OneMap API (list from MOE Website)
Main Upgrading Program Structural HDB Website

2.1 Geospatial Data

Before proceeding to get the various locational and structural data, I will first get the geospatial data for the relevant resale flats of interest. This process makes use of the OneMap API, making calls to retrieve latitude and longitude information given a specific address.

To do so, I will need to filter the dataset to be within Jan 2023 and Sep 2024 after loading it in with read_csv() from tidyr package. Additional pre-processing steps are also done to create the address field, the remaining lease year and month fields.

# Load resale data
resale <- read_csv("data/aspatial/resale.csv") %>%
  filter(month >= "2023-01" & month <= "2024-09")
Rows: 192234 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Preprocess data
resale_tidy <- resale %>%
  mutate(address = paste(block,street_name)) %>%
  mutate(remaining_lease_yr = as.integer(
    str_sub(remaining_lease, 0, 2)))%>%
  mutate(remaining_lease_mth = as.integer(
    str_sub(remaining_lease, 9, 11)))

A unique list of the address is obtained to minimize making duplicate API calls to the same address

# Unique list 
add_list <- sort(unique(resale_tidy$address))

# Unique address by separate columns
unique_address <- resale_tidy %>%
  distinct(town, street_name, block)

# Save unique address
write_csv(unique_address,"data/aspatial/unique_address.csv")

The function to retrieve the coordinates of the location given a specific address is as follows:

# Function to retrieve coordinates
get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, 
                            postal = postal, 
                            latitude = lat, 
                            longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, 
                                postal = NA, 
                                latitude = NA, 
                                longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, 
                              postal = postal, 
                              latitude = lat, 
                              longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, 
                            postal = NA, 
                            latitude = NA, 
                            longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

After getting the coordinates, they are then saved as rds format for easy retrieval.

# Get coordinates of hdb addresses
coords <- get_coords(add_list)

# Save coordinates as rds
write_rds(coords, "data/processed/coords_full.rds")

The coordinates will be joined together with the resale_tidy dataset. A quick check is needed to ensure that all coordinates retrieved by this process have valid latitude and longitude. While there are a couple of addresses with no postal code, I can just proceed since there are still latitude and longitude coordinates for them.

# Load coordinates data
coords <- read_rds("data/processed/coords_full.rds")

# Check coordinates data 
coords[(is.na(coords$postal) | is.na(coords$latitude) | is.na(coords$longitude) | coords$postal=="NIL"), ]
                    address postal         latitude        longitude
2129 215 CHOA CHU KANG CTRL    NIL 1.38308302434129 103.747076627693
2141 216 CHOA CHU KANG CTRL    NIL 1.38308302434129 103.747076627693
# Join coordinates to resale_tidy dataset
resale_sf <- left_join(resale_tidy, coords, by = c('address' = 'address')) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs=3414)

2.2 Locational Data

Using httr and jsonlite, I will first get the full list of available themes in OneMap API. The token can be requested from the OneMap website.

token <- 'your_token_here'
# Url to get Theme Info
url <- "https://www.onemap.gov.sg/api/public/themesvc/getAllThemesInfo?moreInfo=Y"

# Update query params and header
headers <- httr::add_headers(
  Authorization = token
)

# Make the GET request
response <- httr::GET(url, headers)

# Extract the information
content <- content(response, "text")
theme_df <- as.data.frame(fromJSON(content))

# Save theme as rds
write_rds(theme_df, "data/processed/theme.rds")

The Master Plan Subzone boundary data (2019) is also loaded with st_read() from sf package and projected accordingly with st_transform() . This will be important for visualisation of different geospatial data in the subsequent steps.

mpsz19_shp <- st_read(dsn = "data/geospatial/MPSZ-2019/", layer = "MPSZ-2019") %>% 
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\MPSZ-2019' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

2.2.1 CBD

The central business district (CBD), is an important region in Singapore since it is where many businesses reside. This is important particularly to working adults as staying at a HDB that is located closer to the CBD could mean the shorter travelling time between their workplace and home. While in recent years there are various other business districts (East - Changi Business Park, Tampines, Paya Lebar; West - One North Business Park), this exercise will only consider the Central Business District.

CBD is defined by Google to have the following coordinates: latitude: 1.287953, longitude 103.851784. I will save this as a sf dataframe with st_as_sf() from sf package. I will then save this in rds.

# Cbd lat lon data
cbd_lat <- 1.287953
cbd_lon <- 103.851784

# Create cbd_sf dataframe
cbd_sf <- data.frame(cbd_lat, cbd_lon) %>%
  st_as_sf(coords = c("cbd_lon", "cbd_lat"), crs = 4326) %>%
  st_transform(crs=3414)

# Save as rds
write_rds(cbd_sf, "data/processed/cbd.rds")

2.2.2 Eldercare

From the themes, I can see that the query parameter for Eldercare is “eldercare”. After retrieving the data with a GET request, I will save it as a rds file for easy retrieval.

base_url <- "https://www.onemap.gov.sg/api/public/themesvc/retrieveTheme?queryName="
parameter <- "eldercare"

# Update query params and header
headers <- httr::add_headers(
  Authorization = token
)

url <- paste0(base_url,parameter)

# Make the GET request
response <- httr::GET(url, headers)

# Extract the information
content <- content(response, "text")
eldercare_df <- as.data.frame(fromJSON(content))

# Save eldercare_df
write_rds(eldercare_df, "data/processed/eldercare_df.rds")

Now, I can load the data and check it.

# Load eldercare_df
eldercare_df <- read_rds("data/processed/eldercare_df.rds")

# Check output
datatable(head(eldercare_df), rownames=FALSE, width="100%",
            options=list(scrollX = TRUE, 
                         initComplete = JS(
                           "function(settings, json) {",
                           "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff', 'font-family': 'sans-serif', 'font-size': '12px'});",
                           "$(this.api().table().body()).css({'font-family': 'sans-serif', 'font-size': '12px'});",
                           "}"
    )))

After looking at the eldercare dataframe, it is clear that i will need to do some additional processing

  1. Separate the LatLng column (single column of into lat and lon)
  2. Select only the columns i need, removing the first column with a filter condition
  3. Now i can proceed to use st_as_sf() from sf package to convert it into a sf dataframe
  4. st_transform() is then used to do the appropriate projection

Once done, I will save this as an rds file.

# Rename colnames
colnames(eldercare_df) <- gsub("^SrchResults\\.", "", colnames(eldercare_df))

# Process sf dataframe
eldercare_sf <- eldercare_df %>%
  separate(LatLng, into = c("lat", "lon"), sep = ",") %>%
  select(NAME, ADDRESSPOSTALCODE, ADDRESSSTREETNAME, Type, lat, lon) %>%
  filter(!is.na(NAME)) %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  st_transform(crs=3414)


# Save sf dataframe
write_rds(eldercare_sf, "data/processed/eldercare_sf.rds")

Since there is also an eldercare dataset at data.gov.sg, I will load that in along with the previously saved eldercare sf dataframe.

# Load Eldercare Data.gov.sg dataset 
eldercare_dg_sf <- st_read("data/geospatial/EldercareServicesSHP", layer = "ELDERCARE") %>%
  st_transform(crs = 3414)
Reading layer `ELDERCARE' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\EldercareServicesSHP' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
# Load Saved OneMap API data
eldercare_sf <- read_rds("data/processed/eldercare_sf.rds")

By visualising both datasets side by side with tmap_arrange() from tmap package, I can better see if there are any discrepancies to decide which datasets to use and if any additional processing is required.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
# Calculate row counts
num_eldercare_sf <- nrow(eldercare_sf)
num_eldercare_dg_sf <- nrow(eldercare_dg_sf)


tmap_options(check.and.fix = TRUE)

# Visualize eldercare from OneMap API
map_onemap <- tm_shape(mpsz19_shp) +
  tm_polygons() +
  tm_shape(eldercare_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = paste("Eldercare (OneMap API) - Count:", num_eldercare_sf),
            main.title.position = ("center"))


# Visualize eldercare from Data.gov.sg
map_data_gov <- tm_shape(mpsz19_shp) +
  tm_polygons() +
  tm_shape(eldercare_dg_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = paste("Eldercare (Data Gov) - Count:", num_eldercare_dg_sf),
            main.title.position = ("center"))

# Visualize both side by side
tmap_arrange(map_onemap, map_data_gov, ncol = 2, nrow = 1)
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Since the counts for both are the same and there does not seem to be any noticeable discrepancies between either, I can just proceed to use any of them in the subsequent steps.

2.2.3 Foodcourt/Hawker Centres

The hawker centre dataset from data.gov.sg will be loaded in with st_read() and projected with st_transform() from the sf package.

# Load Hawker Data.gov.sg dataset
hawker_sf <- st_read("data/geospatial/HawkerCentresKML.kml") %>%
  st_transform(crs = 3414)
Reading layer `HAWKERCENTRE' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\HawkerCentresKML.kml' 
  using driver `KML'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

As with other datasets, I will visualize it with tmap to ensure that it is fine to use.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize hawker centres
tm_shape(mpsz19_shp) +
  tm_polygons() + 
tm_shape(hawker_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = "Hawker Centre (Data Gov)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps.

2.2.4 MRT

The MRT dataset will be loaded in from the TrainStationExit dataset in LTA Datamall with st_read() and projected with st_transform() from the sf package.

# Load train station exit data
train_sf <- st_read(dsn = "data/geospatial/TrainStationExit", layer = "Train_Station_Exit_Layer") %>%
  st_transform(crs = 3414)
Reading layer `Train_Station_Exit_Layer' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\TrainStationExit' 
  using driver `ESRI Shapefile'
Simple feature collection with 593 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21

As with other datasets, I will visualize it with tmap to ensure that it is fine to use.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize train station exits
tm_shape(mpsz19_shp) +
  tm_polygons() + 
tm_shape(train_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = "Train Station Exits (LTA DataMall)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps.

2.2.5 Park

From the themes, I noticed that there are a few candidates for the query parameter for parks: “nationalparks”, “nparks_parks”, “nr_gaz_2005”. After some trial and error, I decided to go with “nationalparks” and retrieved the data with a GET request. Following which, I will save it as a rds file for easy retrieval.

base_url <- "https://www.onemap.gov.sg/api/public/themesvc/retrieveTheme?queryName="
parameter <- "nationalparks"

# Update query params and header
headers <- httr::add_headers(
  Authorization = token
)

url <- paste0(base_url,parameter)

# Make the GET request
response <- httr::GET(url, headers)

# Extract the information
content <- content(response, "text")
parks_df <- as.data.frame(fromJSON(content))

# Save parks_df
write_rds(parks_df, "data/processed/parks_df.rds")

Now, I can load the data and check it.

# Load parks_df
parks_df <- read_rds("data/processed/parks_df.rds")

# Check output
datatable(head(parks_df), rownames=FALSE, width="100%",
            options=list(scrollX = TRUE, 
                         initComplete = JS(
                           "function(settings, json) {",
                           "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff', 'font-family': 'sans-serif', 'font-size': '12px'});",
                           "$(this.api().table().body()).css({'font-family': 'sans-serif', 'font-size': '12px'});",
                           "}"
    )))

After looking at the parks dataframe, it is clear that i will need to do some additional processing similar with eldercare

  1. Separate the LatLng column (single column of into lat and lon)
  2. Select only the columns i need, removing the first column with a filter condition
  3. Now i can proceed to use st_as_sf() from sf package to convert it into a sf dataframe
  4. st_transform() is then used to do the appropriate projection

Once done, I will save this as an rds file.

# Rename colnames
colnames(parks_df) <- gsub("^SrchResults\\.", "", colnames(parks_df))

# Process sf dataframe
parks_sf <- parks_df %>%
  separate(LatLng, into = c("lat", "lon"), sep = ",") %>%
  select(NAME, Type, lat, lon) %>%
  filter(!is.na(NAME)) %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  st_transform(crs=3414)

# Save sf dataframe
write_rds(parks_sf, "data/processed/parks_sf.rds")

Since there is also an parks dataset at data.gov.sg, I will load that in along with the previously saved parks sf dataframe.

# Load data.gov.sg parks data
parks_dg_sf <- st_read("data/geospatial/Parks.kml") %>%
  st_transform(crs = 3414)
Reading layer `NATIONALPARKS' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\Parks.kml' 
  using driver `KML'
Simple feature collection with 430 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
# Load Saved OneMap API data
parks_sf <- read_rds("data/processed/parks_sf.rds")

Similar with eldercare, I will visualise both datasets side by side with tmap_arrange() from tmap package, to better see if there are any discrepancies to decide which datasets to use and if any additional processing is required.

# Set tmap mode to view
tmap_mode("plot")
tmap mode set to plotting
# Calculate row counts
num_parks_sf <- nrow(parks_sf)
num_parks_dg_sf <- nrow(parks_dg_sf)

tmap_options(check.and.fix = TRUE)

# Visualize parks from OneMap API
map_onemap <- tm_shape(mpsz19_shp) +
  tm_polygons() +
  tm_shape(parks_sf) +
  tm_dots(col='red', size = 0.1) +
  tm_layout(main.title = paste("Parks (OneMap API) - Count:", num_parks_sf), 
            main.title.position = ("center"))


# Visualize parks from Data.gov.sg
map_data_gov <- tm_shape(mpsz19_shp) +
  tm_polygons() +
  tm_shape(parks_dg_sf) +
  tm_dots(col='red', size = 0.1) +
  tm_layout(main.title = paste("Parks (Data Gov) - Count:", num_parks_dg_sf), 
            main.title.position = ("center"))

# Visualize both side by side
tmap_arrange(map_onemap, map_data_gov, ncol = 2, nrow = 1)
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Parks from OneMap API are greater than that from Data.gov. Also, Data.gov has 1 park suspiciously south of Singapore in the middle of 2 islands. Therefore, moving forward I will use the OneMap API dataset for Parks instead. Nevertheless, parks from the North-Eastern Islands” will be removed as I dont think they are relevant in consideration for proximity to HDBs. Once done, I will save this as an updated rds file.

# Remove North-Eastern Islands parks
mpsz19_shp_filtered <- mpsz19_shp %>%
  filter(SUBZONE_N != "NORTH-EASTERN ISLANDS")

# Filter parks that are within filtered mpsz19
parks_updated_sf <- parks_sf[
  apply(st_within(parks_sf, mpsz19_shp_filtered, sparse = FALSE), 1, any), 
]

Now I will visualize it again to confirm

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize train station exits
tm_shape(mpsz19_shp) +
  tm_polygons() + 
tm_shape(parks_updated_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = "Parks (OneMap API) Updated", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Now, I can save them to be used in the subseqent steps.

# Save updated park sf dataframe as rds
write_rds(parks_updated_sf, "data/processed/parks_updated_sf.rds")

2.2.6 Primary School

The list of primary schools is obtained from the MOE website.

primary_school_list <- c("Admiralty Primary School","Ahmad Ibrahim Primary School","Ai Tong School","Alexandra Primary School",
                         "Anchor Green Primary School","Anderson Primary School","Ang Mo Kio Primary School","Anglo-Chinese School (Junior)",
                         "Anglo-Chinese School (Primary)","Angsana Primary School","Beacon Primary School","Bedok Green Primary School",
                         "Bendemeer Primary School","Blangah Rise Primary School","Boon Lay Garden Primary School","Bukit Panjang Primary School",
                         "Bukit Timah Primary School","Bukit View Primary School","Canberra Primary School","Canossa Catholic Primary School",
                         "Cantonment Primary School","Casuarina Primary School","Catholic High School (Primary)","Cedar Primary School",
                         "Changkat Primary School","CHIJ (Katong) Primary","CHIJ (Kellock)","CHIJ Our Lady of Good Counsel",
                         "CHIJ Our Lady of the Nativity","CHIJ Our Lady Queen of Peace","CHIJ Primary (Toa Payoh)",
                         "CHIJ St. Nicholas Girls' School (Primary Section)","Chongfu School","Chongzheng Primary School",
                         "Chua Chu Kang Primary School","Clementi Primary School","Compassvale Primary School","Concord Primary School",
                         "Corporation Primary School","Damai Primary School","Dazhong Primary School","De La Salle School",
                         "East Spring Primary School","Edgefield Primary School","Elias Park Primary School","Endeavour Primary School",
                         "Evergreen Primary School","Fairfield Methodist School (Primary)","Farrer Park Primary School",
                         "Fengshan Primary School","Fern Green Primary School","Fernvale Primary School","First Toa Payoh Primary School",
                         "Frontier Primary School","Fuchun Primary School","Fuhua Primary School","Gan Eng Seng Primary School",
                         "Geylang Methodist School (Primary)","Gongshang Primary School","Greendale Primary School","Greenridge Primary School",
                         "Greenwood Primary School","Haig Girls' School","Henry Park Primary School","Holy Innocents' Primary School",
                         "Hong Wen School","Horizon Primary School","Hougang Primary School","Huamin Primary School","Innova Primary School",
                         "Jiemin Primary School","Jing Shan Primary School [zh]","Junyuan Primary School","Jurong Primary School",
                         "Jurong West Primary School","Keming Primary School","Kheng Cheng School","Kong Hwa School","Kranji Primary School",
                         "Kuo Chuan Presbyterian Primary School","Lakeside Primary School","Lianhua Primary School","Maha Bodhi School",
                         "Maris Stella High School (Primary Section)","Marsiling Primary School","Marymount Convent School",
                         "Mayflower Primary School","Mee Toh School","Meridian Primary School","Methodist Girls' School (Primary)",
                         "Montfort Junior School","Nan Chiau Primary School","Nan Hua Primary School","Nanyang Primary School",
                         "Naval Base Primary School","New Town Primary School","Ngee Ann Primary School","North Spring Primary School",
                         "North View Primary School","North Vista Primary School","Northland Primary School","Northoaks Primary School",
                         "Northshore Primary School","Oasis Primary School","Opera Estate Primary School","Palm View Primary School",
                         "Park View Primary School","Pasir Ris Primary School","Paya Lebar Methodist Girls' School (Primary)",
                         "Pei Chun Public School","Pei Hwa Presbyterian Primary School","Pei Tong Primary School","Peiying Primary School",
                         "Pioneer Primary School","Poi Ching School","Princess Elizabeth Primary School","Punggol Cove Primary School",
                         "Punggol Green Primary School","Punggol Primary School","Punggol View Primary School","Qifa Primary School",
                         "Qihua Primary School","Queenstown Primary School","Radin Mas Primary School","Raffles Girls' Primary School",
                         "Red Swastika School","River Valley Primary School","Riverside Primary School","Rivervale Primary School",
                         "Rosyth School","Rulang Primary School","Sembawang Primary School","Seng Kang Primary School",
                         "Sengkang Green Primary School","Shuqun Primary School","Si Ling Primary School",
                         "Singapore Chinese Girls’ School (Primary)","South View Primary School","Springdale Primary School",
                         "St. Andrew's Junior School","St. Anthony's Canossian Primary School","St. Anthony's Primary School",
                         "St. Gabriel's Primary School","St. Hilda's Primary School","St. Joseph's Institution Junior",
                         "St. Margaret's Primary School","St. Stephen's School","Tampines North Primary School","Tampines Primary School",
                         "Tanjong Katong Primary School","Tao Nan School","Teck Ghee Primary School","Teck Whye Primary School",
                         "Telok Kurau Primary School","Temasek Primary School","Townsville Primary School","Unity Primary School",
                         "Valour Primary School","Waterway Primary School","Wellington Primary School","West Grove Primary School",
                         "West Spring Primary School","West View Primary School","Westwood Primary School","White Sands Primary School",
                         "Woodgrove Primary School","Woodlands Primary School","Woodlands Ring Primary School","Xinghua Primary School",
                         "Xingnan Primary School","Xinmin Primary School","Xishan Primary School","Yangzheng Primary School",
                         "Yew Tee Primary School","Yio Chu Kang Primary School","Yishun Primary School","Yu Neng Primary School",
                         "Yuhua Primary School","Yumin Primary School","Zhangde Primary School","Zhenghua Primary School","Zhonghua Primary School"
                         )

For each of the primary schools, I will reuse the get_coords function to retrieve the coordinates. These will then be saved as rds for easy retrieval.

# Get coordinates of primary schools
primary_school_coords <- get_coords(primary_school_list)

# Save coordinates as rds
write_rds(primary_school_coords, "data/processed/primary_school_coords.rds")

Now i can load in the primary school coordinates and check if there are any missing data.

# Load primary school coordinates
primary_school_coords <- read_rds("data/processed/primary_school_coords.rds")
  
# Check coordinates data 
primary_school_coords[(is.na(primary_school_coords$postal) | is.na(primary_school_coords$latitude) | is.na(primary_school_coords$longitude) | primary_school_coords$postal=="NIL"), ]
[1] address   postal    latitude  longitude
<0 rows> (or 0-length row.names)

Now that the coordinates are fine, I can proceed to convert them into a sf dataframe with st_as_sf() and do the appropriate projection with st_transform(). It will then be saved as a rds file.

# Convert primary_school_coords to sf
primary_sch_sf <- primary_school_coords %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs=3414) 


# Save primary_sch_sf as rds
write_rds(primary_sch_sf, "data/processed/primary_sch_sf.rds")

Now i can load the rds file back.

# Load prmary school rds
primary_sch_sf <- read_rds("data/processed/primary_sch_sf.rds")

As with other datasets, I will visualize it with tmap to ensure that it is fine to use.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize primary schools
tm_shape(mpsz19_shp) +
  tm_polygons() + 
tm_shape(primary_sch_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = "Primary Schools (OpenMap API)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps. However, 1 further step is required given that we want to identify “good” primary schools. Although my seniors used schlah, I decided to go outofbox after cross referencing its list with others (e.g. propertyguru, asipretutor, etc.) While outofbox does not have a sophisticated model like schlah, I find that demand through oversubscription is the purest form of demand. For this list, I will use the top 20 which is about ~10% of the primary schools in Singapore.

After filtering, this will be saved to be used in the later sections.

# List of good primary schools
good_primary_school_list <- c("Methodist Girls' School (Primary)", "Tao Nan School", "Ai Tong School", 
                              "Holy Innocents' Primary School", "CHIJ St. Nicholas Girls' School (Primary Section)", 
                              "Admiralty Primary School", "St. Joseph's Institution Junior", "Catholic High School (Primary)",
                              "Anglo-Chinese School (Junior)", "Chongfu School", "Kong Hwa School", "St. Hilda's Primary School", 
                              "Anglo-Chinese School (Primary)", "Nan Chiau Primary School", "Nan Hua Primary School", 
                              "Nanyang Primary School", "Pei Hwa Presbyterian Primary School", "Kuo Chuan Presbyterian Primary School",
                              "Rulang Primary School", "Singapore Chinese Girls’ School (Primary)")

# Filter primary school
gd_primary_sch_sf <- primary_sch_sf %>%
  filter(address %in% good_primary_school_list)

# Save to rds
write_rds(gd_primary_sch_sf, "data/processed/gd_primary_sch_sf.rds")

2.2.7 Shopping Mall

Shopping mall seems to be the most tricky out of all the other datasets. While there are many lists online, I chanced upon the OpenStreetMap API via the overpass-turbo. Since shopping malls on OpenStreetMap have the attribute “shop”= “mall”, a query can be run to retrieve all shopping malls in Singapore.

Note

Besides this, I have explored the following:

  • There is also a osmdata package. While it is also able to extract OpenStreetMap data, the resulting data is counter intuitive (i.e. containing data from Johor Bahru) and it has separate points and polygons output.
  • OpenStreetMap data from HDX containing buildings from Singapore. After working with it, I found that filtering buildings by “shop” = “mall” does not give much results. This may be because shopping malls are not listed as buildings.
  1. Go to overpass-turbo website
  2. Run the following query:
  3. After generating the results, they can be exported as KML and saved into the data folder.

I will load in the shopping mall data from OpenStreetMap with st_read() and do some basic processing:

  • Projection with st_transform() from sf package
  • Filter out data that have no name
  • Use st_point_on_surface() to convert all “POLYGON” (if any) into “POINT”.
# Load Osm shopping mall data
shoppingmall_sf <- st_read("data/geospatial/Shopping Mall.kml") %>%
  st_transform(crs = 3414) %>%
  filter(Name != "" & !is.na(Name)) %>%
  st_point_on_surface()
Reading layer `overpass-turbo.eu export' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\Shopping Mall.kml' 
  using driver `KML'
Simple feature collection with 255 features and 2 fields
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: 103.6777 ymin: 1.2466 xmax: 103.9945 ymax: 1.448547
Geodetic CRS:  WGS 84
Warning: st_point_on_surface assumes attributes are constant over geometries

I will use the list of shopping malls from Wikipedia to cross reference the 2 lists of malls.

# Mall List from Wiki
mall_wiki_list <- c("100 AM","313@Somerset","Aperia","Balestier Hill Shopping Centre","Bugis Cube","Bugis Junction",
                      "Bugis+","Capitol Piazza","Cathay Cineleisure Orchard","Clarke Quay Central","The Centrepoint",
                      "City Square Mall","City Gate Mall","CityLink Mall","Duo","Far East Plaza","Funan","Great World City",
                      "HDB Hub","Holland Village Shopping Mall","ION Orchard","Junction 8","Knightsbridge","Liat Towers",
                      "Lucky Plaza","Marina Bay Sands","The Shoppes at Marina Bay Sands","Marina Bay Link Mall",
                      "Marina Square","Millenia Walk","Mustafa Shopping Centre","Ngee Ann City","One Holland Village",
                      "Orchard Central","Orchard Gateway","Orchard Plaza","Midpoint Orchard","Palais Renaissance",
                      "People's Park Centre","People's Park Complex","Plaza Singapura","GRiD(pomo)","Raffles City",
                      "Scotts Square","Shaw House and Centre","Sim Lim Square","Singapore Shopping Centre","The South Beach",
                      "Square 2","Sunshine Plaza","Suntec City","Tanglin Mall","Tanjong Pagar Centre","Tekka Centre",
                      "The Adelphi","The Paragon","Tiong Bahru Plaza","The Poiz","Thomson Plaza","United Square","Thomson V",
                      "Velocity@Novena Square","Wheelock Place","Wisma Atria","Zhongshan Mall","Bedok Mall","Century Square",
                      "City Plaza","Changi City Point","Downtown East","Djitsun Mall Bedok","Eastpoint Mall","Jewel Changi Airport",
                      "KINEX (formerly OneKM)","Katong Shopping Centre","Katong Square","Kallang Wave Mall","Leisure Park Kallang",
                      "i12 Katong","Our Tampines Hub","Parkway Parade","Pasir Ris Mall","Pasir Ris West Plaza","Paya Lebar Square",
                      "Paya Lebar Quarter (PLQ)","Roxy Square","Singpost Centre","Tampines 1","Tampines Mall","White Sands",
                      "Elias Mall","Loyang Point","888 Plaza","Admiralty Place","AMK Hub","Canberra Plaza","Causeway Point",
                      "Broadway Plaza","Jubilee Square","Junction Nine","Marsiling Mall","Northpoint City","Sembawang Shopping Centre",
                      "Sun Plaza","Vista Point","Wisteria Mall","Woodlands Civic Centre","Woodlands Mart","Woodlands North Plaza",
                      "Anchorvale Village","Buangkok Square","Compass One","Greenwich V","Heartland Mall","Hougang 1",
                      "Hougang Green Shopping Mall","Hougang Mall","NEX","Northshore Plaza","Oasis Terraces","Punggol Coast Mall",
                      "Punggol Plaza","Rivervale Mall","Rivervale Plaza","Sengkang Grand Mall","The Seletar Mall",
                      "Upper Serangoon Shopping Centre","Waterway Point","myVillage At Serangoon Garden","Beauty World Centre",
                      "Beauty World Plaza","Bukit Panjang Plaza","Bukit Timah Plaza","Fajar Shopping Centre",
                      "Greenridge Shopping Centre","Hillion Mall","HillV2","Junction 10","Keat Hong Shopping Centre",
                      "Limbang Shopping Centre","Lot One","Rail Mall","Sunshine Place","Teck Whye Shopping Centre","West Mall",
                      "Yew Tee Point","Yew Tee Square","VivoCity","HarbourFront Centre","Alexandra Retail Centre","321 Clementi",
                      "The Clementi Mall","IMM","Jem","Westgate","Jurong Point","Pioneer Mall","The Star Vista","Alexandra Central",
                      "Anchorpoint","OD Mall","Boon Lay Shopping Centre","Grantral Mall","Fairprice Hub","Gek Poh Shopping Centre",
                      "Rochester Mall","Taman Jurong Shopping Centre","West Coast Plaza","Plantation Plaza","Queensway Shopping Centre","The Rail Mall"
                      )

It is always good to visualize the data to check for anomaly.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize shopping malls
tm_shape(mpsz19_shp) +
  tm_polygons() + 
tm_shape(shoppingmall_sf) +
  tm_dots(col='red', size = 0.1) +
  tm_layout(main.title = "Shopping Malls (OpenStreetMap)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Using setdiff(), I will compare both lists, to see which malls are listed in one list but not the other.

# Only in wiki
only_in_wiki <- setdiff(mall_wiki_list, shoppingmall_sf$Name)
sort(only_in_wiki)
 [1] "100 AM"                         "321 Clementi"                  
 [3] "AMK Hub"                        "Anchorvale Village"            
 [5] "Aperia"                         "Balestier Hill Shopping Centre"
 [7] "Buangkok Square"                "Bugis Cube"                    
 [9] "Capitol Piazza"                 "City Gate Mall"                
[11] "Duo"                            "Eastpoint Mall"                
[13] "Grantral Mall"                  "GRiD(pomo)"                    
[15] "HarbourFront Centre"            "HillV2"                        
[17] "Holland Village Shopping Mall"  "Jem"                           
[19] "KINEX (formerly OneKM)"         "Lot One"                       
[21] "Marina Bay Sands"               "Marsiling Mall"                
[23] "Mustafa Shopping Centre"        "myVillage At Serangoon Garden" 
[25] "NEX"                            "Northshore Plaza"              
[27] "Oasis Terraces"                 "OD Mall"                       
[29] "Our Tampines Hub"               "Pasir Ris West Plaza"          
[31] "Paya Lebar Quarter (PLQ)"       "People's Park Centre"          
[33] "People's Park Complex"          "Punggol Coast Mall"            
[35] "Raffles City"                   "Rail Mall"                     
[37] "Shaw House and Centre"          "Singapore Shopping Centre"     
[39] "Singpost Centre"                "Tanjong Pagar Centre"          
[41] "Teck Whye Shopping Centre"      "The Adelphi"                   
[43] "The Paragon"                    "The Poiz"                      
[45] "The Rail Mall"                  "The Seletar Mall"              
[47] "The South Beach"                "Thomson V"                     
[49] "Velocity@Novena Square"         "Woodlands Civic Centre"        
[51] "Woodlands Mart"                 "Woodlands North Plaza"         
# Only in osm
only_in_osm <- setdiff(shoppingmall_sf$Name, mall_wiki_list)
sort(only_in_osm)
  [1] "100AM"                                       
  [2] "18 Tai Seng"                                 
  [3] "Ang Mo Kio Hub"                              
  [4] "Aperia Mall"                                 
  [5] "Balestier Plaza"                             
  [6] "Balmoral Plaza"                              
  [7] "Bencoolen"                                   
  [8] "Bencoolen Underground Mall"                  
  [9] "Big Block"                                   
 [10] "Bras Basah Complex"                          
 [11] "Buangkok Square Mall"                        
 [12] "Bukit Timah Shopping Center"                 
 [13] "Changi Airport Terminal 1"                   
 [14] "Changi Airport Terminal 2"                   
 [15] "Changi Airport Terminal 3"                   
 [16] "Clementi 321"                                
 [17] "Cluny Court"                                 
 [18] "Concorde Shopping Centre"                    
 [19] "Coronation Plaza"                            
 [20] "Dairy Farm Mall"                             
 [21] "Dawson Place"                                
 [22] "Delfi Orchard"                               
 [23] "Depot Heights Shopping Centre"               
 [24] "Design Orchard"                              
 [25] "Downtown Gallery"                            
 [26] "Duo Residences"                              
 [27] "East Point"                                  
 [28] "Eastlink Mall"                               
 [29] "Esplanade Xchange"                           
 [30] "Excelsior Shopping Centre"                   
 [31] "Far East Shopping Centre"                    
 [32] "Fortune Centre"                              
 [33] "Forum The Shopping Mall"                     
 [34] "Fusionopolis 1"                              
 [35] "GR.ID"                                       
 [36] "Grandlink Square"                            
 [37] "Grantral Mall @ Clementi"                    
 [38] "Guoco Tower"                                 
 [39] "HillV2 Mall"                                 
 [40] "Icon Village"                                
 [41] "Jalan Besar Plaza"                           
 [42] "JEM"                                         
 [43] "Jewel Changi Basement"                       
 [44] "Joo Chiat Complex"                           
 [45] "Kang Kar Mall"                               
 [46] "Katong Plaza"                                
 [47] "Katong V"                                    
 [48] "Kinex"                                       
 [49] "Le Quest Mall"                               
 [50] "Liat Tower"                                  
 [51] "Liv@Changi"                                  
 [52] "Lot One Shoppers' Mall"                      
 [53] "MacPherson Mall"                             
 [54] "Mandarin Gallery"                            
 [55] "Marina One"                                  
 [56] "Marsiling Mall Hawker Centre"                
 [57] "Mustafa Centre"                              
 [58] "myVillage"                                   
 [59] "nex"                                         
 [60] "Northshore Plaza I"                          
 [61] "Northshore Plaza II"                         
 [62] "Odeon Katong Shopping Complex"               
 [63] "Old Airport Road Food Centre & Shopping Mall"
 [64] "Orchard Building"                            
 [65] "Orchard Gateway @ Emerald"                   
 [66] "Orchard Shopping Centre"                     
 [67] "Orchard Towers"                              
 [68] "OUE Downtown"                                
 [69] "Pacific Plaza"                               
 [70] "Paragon"                                     
 [71] "Parklane Shopping Mall"                      
 [72] "Peninsula Plaza"                             
 [73] "Plaza 8"                                     
 [74] "Quayside Isle"                               
 [75] "Raffles City Shopping Centre"                
 [76] "Royal Square at Novena"                      
 [77] "Seletar Mall"                                
 [78] "Shaw House"                                  
 [79] "Shaw Plaza"                                  
 [80] "Simei Plaza /Blk 248"                        
 [81] "SingPost Centre"                             
 [82] "South Beach Avenue"                          
 [83] "South Beach Quarter"                         
 [84] "STELLAR@TE2"                                 
 [85] "Tampines Mart"                               
 [86] "Tang Plaza"                                  
 [87] "Tanglin Shopping Centre"                     
 [88] "Tanjong Katong Complex"                      
 [89] "Tekka Place"                                 
 [90] "The Cathay"                                  
 [91] "The Centris"                                 
 [92] "The Concourse"                               
 [93] "The Flow Mall"                               
 [94] "The Heeren"                                  
 [95] "The Oasis"                                   
 [96] "The Poiz Centre"                             
 [97] "The Woodgrove"                               
 [98] "The Woodleigh Mall"                          
 [99] "Trio"                                        
[100] "Valley Point Shopping Centre"                
[101] "Velocity @ Novena Square"                    

Comparing both lists, I found that there are a number of malls that are labelled different in both lists. This should not be an issue. However, there are some malls in the Wikipedia list that are not in the OpenStreetMap list.

After further investigation, I compiled the list of missing malls. There are also malls that are listed in the Wikipedia but I found it to be not relevant for the current exercise:

  • OD Mall: Removed
  • Our Tampines Hub: Not a mall
  • Punggol Coast Mall: Still under construction
  • Thomson V: Not a mall
# Compile list of missing malls
missing_malls <- c("Anchorvale Village", "Balestier Hill Shopping Centre", "Bugis Cube", "Capitol Piazza", "City Gate Mall", 
                   "HarbourFront Centre", "Holland Road Shopping Centre", "Oasis Terraces", "Pasir Ris West Plaza", 
                   "Paya Lebar Quarter", "People's Park Centre", "People's Park Complex", "Singapore Shopping Centre", 
                   "Teck Whye Shopping Centre", "The Adelphi", "The Rail Mall", "Woodlands Civic Centre", 
                   "Woodlands Mart", "Woodlands North Plaza")

Using the get_coords function, I can then retrieve the geospatial information for those malls. These will be saved as rds for easy retrieval.

# Get coordinates of missing malls
missing_malls_coords <- get_coords(missing_malls)

# Save coordinates as rds
write_rds(missing_malls_coords, "data/processed/missing_malls_coords.rds")

Now, I will load it back in to check if there are any missing coordinates.

# Load missing malls coordinates
missing_malls_coords <- read_rds("data/processed/missing_malls_coords.rds")

# Check coordinates data 
missing_malls_coords[(is.na(missing_malls_coords$postal) | is.na(missing_malls_coords$latitude) | is.na(missing_malls_coords$longitude) | missing_malls_coords$postal=="NIL"), ]
[1] address   postal    latitude  longitude
<0 rows> (or 0-length row.names)

Now that the coordinates are fine, I can proceed to convert them into a sf dataframe with st_as_sf() and do the appropriate projection with st_transform(). Once done, a slight adjustment will be made to the column names so that it can be easily appended to the shopping mall sf dataframe with rbind().

The full shopping mall sf dataframe will then be saved as a rds file.

# Convert missing_malls_coords to sf
missing_malls_sf <- missing_malls_coords %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs=3414) 

# Process column names to prepare for rbind
colnames(missing_malls_sf)[colnames(missing_malls_sf) == "address"] <- "Name"
colnames(missing_malls_sf)[colnames(missing_malls_sf) == "postal"] <- "Description"

# Merge both datasets
shopping_mall_full_sf <- rbind(shoppingmall_sf, missing_malls_sf) 

# Save shopping_mall_full
write_rds(shopping_mall_full_sf, "data/processed/shopping_mall_sf.rds")

This can then be loaded in for one final check.

# Load shopping mall sf
shoppingmall_sf <- read_rds("data/processed/shopping_mall_sf.rds")

tmap package will be used to visualize the updated shopping mall points.

tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

tm_shape(mpsz19_shp) +
  tm_polygons() +
tm_shape(shoppingmall_sf) +
  tm_dots(col='red', size = 0.1) +
  tm_layout(main.title = "Shopping Malls (OpenStreetMap + OpenMap API)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps.

2.2.8 Supermarket

The supermarket dataset from data.gov.sg will be loaded in with st_read() and projected with st_transform() from the sf package.

# Load Supermarket Data.gov.sg dataset
supermarket_sf <- st_read("data/geospatial/SupermarketsKML.kml") %>%
  st_transform(crs = 3414)
Reading layer `SUPERMARKETS' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\SupermarketsKML.kml' 
  using driver `KML'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

As with other datasets, I will visualize it with tmap to ensure that it is fine to use.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize supermarkets
tm_shape(mpsz19_shp) +
  tm_polygons() +
tm_shape(supermarket_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = "Supermarkets (Data Gov)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Although the data point in Harborfront Centre seems to be suspiciously located, it is deemed that it is acceptable since Harborfront Centre has 2 supermarkets - “Cold Storage” and “Don Don Donki”. Therefore, I will be using this dataset as well for subsequent analysis.

2.2.9 Kindergarten

From the themes, I can see that the query parameter for Kindergarten is “kindergartens”. After retrieving the data with a GET request, I will save it as a rds file for easy retrieval.

base_url <- "https://www.onemap.gov.sg/api/public/themesvc/retrieveTheme?queryName="
parameter <- "kindergartens"

# Update query params and header
headers <- httr::add_headers(
  Authorization = token
)

url <- paste0(base_url,parameter)

# Make the GET request
response <- httr::GET(url, headers)

# Extract the information
content <- content(response, "text")
kindergarten_df <- as.data.frame(fromJSON(content))

# Save as rds
write_rds(kindergarten_df, "data/processed/kindergarten_df.rds")

Now, I can load the data and check it

# Load kindergarten_df
kindergarten_df <- read_rds("data/processed/kindergarten_df.rds")

# Check output
datatable(head(kindergarten_df), rownames=FALSE, width="100%",
            options=list(scrollX = TRUE, 
                         initComplete = JS(
                           "function(settings, json) {",
                           "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff', 'font-family': 'sans-serif', 'font-size': '12px'});",
                           "$(this.api().table().body()).css({'font-family': 'sans-serif', 'font-size': '12px'});",
                           "}"
    )))

After looking at the kindergarten dataframe, it is clear that i will need to do some additional processing similar with eldercare

  1. Separate the LatLng column (single column of into lat and lon)
  2. Select only the columns i need, removing the first column with a filter condition
  3. Now i can proceed to use st_as_sf() from sf package to convert it into a sf dataframe
  4. st_transform() is then used to do the appropriate projection

Once done, I will save this as an rds file.

# Rename colnames
colnames(kindergarten_df) <- gsub("^SrchResults\\.", "", colnames(kindergarten_df))

# Process sf dataframe
kindergarten_sf <- kindergarten_df %>%
  separate(LatLng, into = c("lat", "lon"), sep = ",") %>%
  select(NAME, Type, ADDRESSPOSTALCODE, ADDRESSSTREETNAME, lat, lon) %>%
  filter(!is.na(NAME)) %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  st_transform(crs=3414)

# Save sf dataframe
write_rds(kindergarten_sf, "data/processed/kindergarten_sf.rds")

Since there is also a kindergarten dataset at data.gov.sg, I will load that in along with the previously saved kindergarten sf dataframe.

# Load data.gov.sg kindergarten data
kindergarten_dg_sf <- st_read("data/geospatial/Kindergartens.kml") %>%
  st_transform(crs = 3414)
Reading layer `KINDERGARTENS' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\Kindergartens.kml' 
  using driver `KML'
Simple feature collection with 448 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6887 ymin: 1.247759 xmax: 103.9717 ymax: 1.455452
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
# Load Saved OneMap API data
kindergarten_sf <- read_rds("data/processed/kindergarten_sf.rds")

By visualising both datasets side by side with tmap_arrange() from tmap package, I can better see if there are any discrepancies to decide which datasets to use and if any additional processing is required.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
# Calculate row counts
num_kindergarten_sf <- nrow(kindergarten_sf)
num_kindergarten_dg_sf <- nrow(kindergarten_dg_sf)

tmap_options(check.and.fix = TRUE)

# Visualize kindergarten from OneMap API 
map_onemap <- tm_shape(mpsz19_shp) +
  tm_polygons() +
  tm_shape(kindergarten_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = paste("Kindergartens (OneMap API) - Count:", num_kindergarten_sf), 
            main.title.position = ("center"))


# Visualize kindergarten from Data.gov.sg
map_data_gov <- tm_shape(mpsz19_shp) +
  tm_polygons() +
  tm_shape(kindergarten_dg_sf) +
  tm_dots(col='red', size = 0.2) +
  tm_layout(main.title = paste("Kindergartens (Data Gov) - Count:", num_kindergarten_dg_sf), 
            main.title.position = ("center"))


tmap_arrange(map_onemap, map_data_gov, ncol = 2, nrow = 1)
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Since the counts for both are the same and there does not seem to be any noticeable discrepancies between either, I can just proceed to use any of them in the subsequent steps.

2.2.10 Childcare Centres

base_url <- "https://www.onemap.gov.sg/api/public/themesvc/retrieveTheme?queryName="
parameter <- "childcare"

# Update query params and header
headers <- httr::add_headers(
  Authorization = token
)

url <- paste0(base_url,parameter)

# Make the GET request
response <- httr::GET(url, headers)

# Extract the information
content <- content(response, "text")
childcare_df <- as.data.frame(fromJSON(content))

# Save childcare_df
write_rds(childcare_df, "data/processed/childcare_df.rds")

Now I can load the data and check it.

# Load childcare_df
childcare_df <- read_rds("data/processed/childcare_df.rds")

# Check output
datatable(head(childcare_df), rownames=FALSE, width="100%",
            options=list(scrollX = TRUE, 
                         initComplete = JS(
                           "function(settings, json) {",
                           "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff', 'font-family': 'sans-serif', 'font-size': '12px'});",
                           "$(this.api().table().body()).css({'font-family': 'sans-serif', 'font-size': '12px'});",
                           "}"
    )))

After looking at the childcare dataframe, I will need to do some additional processing similar with eldercare

  1. Separate the LatLng column (single column of into lat and lon)
  2. Select only the columns i need, removing the first column with a filter condition
  3. Now i can proceed to use st_as_sf() from sf package to convert it into a sf dataframe
  4. st_transform() is then used to do the appropriate projection

Once done, I will save this as an rds file.

# Rename colnames
colnames(childcare_df) <- gsub("^SrchResults\\.", "", colnames(childcare_df))


# Process sf dataframe
childcare_sf <- childcare_df %>%
  separate(LatLng, into = c("lat", "lon"), sep = ",") %>%
  select(NAME, Type, ADDRESSPOSTALCODE, ADDRESSSTREETNAME, lat, lon) %>%
  filter(!is.na(NAME)) %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  st_transform(crs=3414)

# Save sf dataframe
write_rds(childcare_sf, "data/processed/childcare_sf.rds")

Now I will load it back in.

# Load Saved OneMap API data
childcare_sf <- read_rds("data/processed/childcare_sf.rds")

As with other datasets, I will visualize it with tmap to ensure that it is fine to use.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize childcare
tm_shape(mpsz19_shp) +
  tm_polygons() +
tm_shape(childcare_sf) +
  tm_dots(col='red', size = 0.1) +
  tm_layout(main.title = "Childcare (OneMap API)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there does not seem to be any anomaly with the points, so I can proceed to use this dataset in the subsequent steps.

2.2.11 Bus Stop

The Bus Stop dataset will be loaded in from the Bus Stop Location dataset (Jul 2024) in LTA Datamall with st_read() and projected with st_transform() from the sf package.

bus_sf <- st_read(dsn = "data/geospatial/BusStopLocation_Jul2024", layer = "BusStop") %>%
  st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\BusStopLocation_Jul2024' 
  using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21

As with other datasets, I will visualize it with tmap to ensure that it is fine to use.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize bus stops
tm_shape(mpsz19_shp) +
  tm_polygons() + 
tm_shape(bus_sf) +
  tm_dots(col='red', size = 0.1) +
  tm_layout(main.title = "Bus Stops (LTA DataMall)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Looking at the map, there are some points that are outside of the Singapore boundaries. These are likely the bus stops in Johor Bahru and should be removed as only bus stops within Singapore should be considered.

# Filter bus stops that are within mpsz19
bus_updated_sf <- bus_sf[
  apply(st_within(bus_sf, mpsz19_shp, sparse = FALSE), 1, any), 
]

A final check with tmap will be used to confirm that there are no further anomalies.

# Set tmap mode to plot
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize bus stops
tm_shape(mpsz19_shp) +
  tm_polygons() + 
tm_shape(bus_updated_sf) +
  tm_dots(col='red', size = 0.1) +
  tm_layout(main.title = "Bus Stops Updated (LTA DataMall)", 
            main.title.position = ("center"))
Warning: The shape mpsz19_shp is invalid. See sf::st_is_valid

Now I can save it for use in subsequent steps.

# Save updated park sf dataframe as rds
write_rds(bus_updated_sf, "data/processed/bus_updated_sf.rds")

2.3 Structural Data

While area of the unit has already been provided in the initial dataset, some data wrangling is required to generate the other structural factors.

2.3.1 Floor Level

The floor level information is current being represented in the storey_range attribute. Using unique(), I can check the different values

# Check storey_range
unique(resale_sf$storey_range)
 [1] "01 TO 03" "04 TO 06" "07 TO 09" "25 TO 27" "10 TO 12" "13 TO 15"
 [7] "16 TO 18" "22 TO 24" "19 TO 21" "34 TO 36" "28 TO 30" "37 TO 39"
[13] "31 TO 33" "40 TO 42" "43 TO 45" "46 TO 48" "49 TO 51"

Although I wont be able to get the exact floor level, I can represent it as an ordinal attribute with mutate() and lengthy case_when() statement. Once done, summary() will be used to check if there are any anomalies.

# Create floor_ord 
resale_sf <- resale_sf %>%
  mutate(floor_ord = case_when(storey_range == "01 TO 03" ~ 1,
                               storey_range == "04 TO 06" ~ 2,
                               storey_range == "07 TO 09" ~ 3,
                               storey_range == "10 TO 12" ~ 4,
                               storey_range == "13 TO 15" ~ 5,
                               storey_range == "16 TO 18" ~ 6,
                               storey_range == "19 TO 21" ~ 7,
                               storey_range == "22 TO 24" ~ 8,
                               storey_range == "25 TO 27" ~ 9,
                               storey_range == "28 TO 30" ~ 10,
                               storey_range == "31 TO 33" ~ 11,
                               storey_range == "34 TO 36" ~ 12,
                               storey_range == "37 TO 39" ~ 13,
                               storey_range == "40 TO 42" ~ 14,
                               storey_range == "43 TO 45" ~ 15,
                               storey_range == "46 TO 48" ~ 16,
                               storey_range == "49 TO 51" ~ 17,
                               TRUE ~ 99))

# Check output
summary(resale_sf$floor_ord)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   3.259   4.000  17.000 

2.3.2 Remaining Lease

Since I already have the “remaining_lease_yr” and “remaining_lease_mth” from the previous step, I just need to combine them into a remaining_lease. This can be done by consolidating the “remaining_lease_yr” into “remaining_lease_mth”. Since the columns may contain NA, i will use coalesce() from dpylr package to replace them to 0. summary() will be used to do a final check.

# Create remaining_lease_val
resale_sf <- resale_sf %>%
  mutate(remaining_lease_val = coalesce(remaining_lease_yr, 0) * 12 + coalesce(remaining_lease_mth, 0))

# Check output
summary(resale_sf$remaining_lease_val)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  497.0   728.0   881.0   885.2  1084.0  1154.0 

2.3.3 Age of Unit

Since this is not given, I will calculate it backwards from the amount of remaining lease. This is assuming that all HDBs in this resale dataset have 99-year leases at the time of purchase.

# Create age of unit
resale_sf <- resale_sf %>%
  mutate(age_unit = 99 * 12 - remaining_lease_val)

# Check outuput
summary(resale_sf$age_unit)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   34.0   104.0   307.0   302.8   460.0   691.0 

2.3.4 Main Upgrading Program (MUP)

To get this information, I have to manually crawl the HDB website for Main Upgrading Program (MUP) and Home Improvement Program (HIP). The HIP is the successor of the MUP, introduced in 2007. This is done for the unique_addresses created in 2.1. The completed data is then saved as hdb_mup_hip.csv, which will be loaded to be combined with the resale dataset.

# Load mup_hip dataset
mup_hip <- read_csv("data/aspatial/hdb_mup_hip.csv")
Rows: 8960 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): town, street_name, block, key, mup_announce_date, mup_bill_date, h...
dbl  (1): mup_status

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

With reference to Manuel Lehner (2011), I will be exploring creating the following binary attributes:

  • MUP_completed - if the HDB has been through the main upgrading program
  • HIP_announced - if the HDB has been announced for the home improvement program.
  • HIP_completed - if the HDB has been through the home improvement program

To facilitate the join, some slight data wrangling is required. dmy() from lubridate package is used to convert the values from character to date. Once done, I will join it to the resale data and create the necessary attributes. I will assume the resale dates to be on the first of the “month” date with ym() from lubridate package. Since the resale dates are in the year 2023 onwards, I can simply assume “mup_status” to be “mup_completed”. HDBs will be coded positive for “hip_completed” if the resale date is later/equal to the hip_bill_date. HDBs will then be coded positive for “hip_announced” if the resale date is before the hip_bill_date but later/equal to the hip_announce_date.

# Process 
mup_hip_df <- mup_hip %>%
  mutate(address = paste(block,street_name),
         hip_announce_date = dmy(hip_announce_date),
         hip_bill_date = dmy(hip_bill_date)) %>%
  select(address, mup_status, hip_announce_date, hip_bill_date) 

# Left join mup_hip_df to resale_sf and create status
resale_merge_sf <- left_join(resale_sf, mup_hip_df, by = "address") %>%
  mutate(sale_date = ym(month),
         hip_completed = case_when(is.na(hip_bill_date) ~ 0,
                                   sale_date >= hip_bill_date ~ 1,
                                   TRUE ~ 0),
         hip_announced = case_when(is.na(hip_announce_date) | hip_completed == 1 ~ 0,
                                   sale_date >= hip_announce_date ~ 1,
                                   TRUE ~ 0)
  )
  

# Check output
datatable(head(resale_merge_sf), rownames=FALSE, width="100%",
            options=list(scrollX = TRUE, 
                         initComplete = JS(
                           "function(settings, json) {",
                           "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff', 'font-family': 'sans-serif', 'font-size': '12px'});",
                           "$(this.api().table().body()).css({'font-family': 'sans-serif', 'font-size': '12px'});",
                           "}"
    )))

2.4 Proximity Calculation

Note

Since my seniors have graciously provided with a function to calculate proximity (which they obtained from Megan), I will gratefully leverage it in the subsequent steps.

This proximity function calculates a distance matrix with st_distance() from sf package, then finds the closest one with rowMins() from matrixStats package.

# Proximity function
fn_proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    units::drop_units()
  df1[, varname] <- rowMins(dist_matrix)
  return(df1)
}

I will load all the relevant sf dataframe for the various locational factors.

# Load cbd
cbd_sf <- read_rds("data/processed/cbd.rds")

# Load eldercare
eldercare_sf <- read_rds("data/processed/eldercare_sf.rds")

# Load foodcourt/hawker centre
hawker_sf <- st_read("data/geospatial/HawkerCentresKML.kml") %>%
  st_transform(crs = 3414)
Reading layer `HAWKERCENTRE' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\HawkerCentresKML.kml' 
  using driver `KML'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84
# Load mrt
train_sf <- st_read(dsn = "data/geospatial/TrainStationExit", layer = "Train_Station_Exit_Layer") %>%
  st_transform(crs = 3414)
Reading layer `Train_Station_Exit_Layer' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\TrainStationExit' 
  using driver `ESRI Shapefile'
Simple feature collection with 593 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
# Load park
parks_sf <- read_rds("data/processed/parks_updated_sf.rds")

# Load good primary schools
gd_primary_sch_sf <- read_rds("data/processed/gd_primary_sch_sf.rds")

# Load shopping mall
shoppingmall_sf <- read_rds("data/processed/shopping_mall_sf.rds")

# Load supermarket
supermarket_sf <- st_read("data/geospatial/SupermarketsKML.kml") %>%
  st_transform(crs = 3414)
Reading layer `SUPERMARKETS' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\SupermarketsKML.kml' 
  using driver `KML'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

The proximity function is then run for all the various datasets.

# Compute proximity for all at once
resale_merge_sf <- fn_proximity(resale_merge_sf, cbd_sf, "PROX_CBD") %>%
  fn_proximity(., eldercare_sf, "PROX_ELDERCARE") %>%
  fn_proximity(., hawker_sf, "PROX_HAWKER") %>%
  fn_proximity(., train_sf, "PROX_MRT") %>%
  fn_proximity(., parks_sf, "PROX_PARK") %>%
  fn_proximity(., gd_primary_sch_sf, "PROX_GDPRISCH") %>%
  fn_proximity(., shoppingmall_sf, "PROX_MALL") %>%
  fn_proximity(., supermarket_sf, "PROX_SUPERMARKET")

2.5 Facility Count

Note

Similar to proximity, my seniors have graciously provided with a function to calculate proximity (which they also obtained from Megan).

# Count within radius
fn_num_radius <- function(df1, df2, varname, radius) {
  dist_matrix <- st_distance(df1, df2) %>%
    units::drop_units() %>%
    as.data.frame()
  df1[,varname] <- rowSums(dist_matrix <= radius)
  return(df1)
}

I will load all the remaining sf dataframe for the various locational factors.

# Load kindergarten
kindergarten_sf <- read_rds("data/processed/kindergarten_sf.rds")

# Load childcare
childcare_sf <- read_rds("data/processed/childcare_sf.rds")

# Load bus stop
bus_sf <- read_rds("data/processed/bus_updated_sf.rds")

# Load primary school
primary_sch_sf <- read_rds("data/processed/primary_sch_sf.rds")
# Calculate the count within different radius
resale_merge_sf <- fn_num_radius(resale_merge_sf, kindergarten_sf, "WITHIN_350M_KINDERGARTEN", 350) %>%
  fn_num_radius(., childcare_sf, "WITHIN_350M_CHILDCARE", 350) %>%
  fn_num_radius(., bus_sf, "WITHIN_350M_BUS", 350) %>%
  fn_num_radius(., primary_sch_sf, "WITHIN_1KM_PRISCH", 1000)

Now I will save this as a rds for easy retrieval.

# Save output as rds 
write_rds(resale_merge_sf, "data/processed/resale_full.sf")

Finally, I need to do some final preprocessing for the subsequent analysis

  • Sales between 2024 January to 2024 June will be filtered out
  • Filter to only 5-rooms flats.
  • Select only required columns

Once done, I will save this as an rds file for subsequent retrieval.

# Load full resale data
resale_full_sf <- read_rds("data/processed/resale_full.sf")

# Filter resale data and select required columns 
resale_flt_sf <- resale_full_sf %>%
  filter(!(sale_date >= as.Date("2024-01-01") & sale_date <= as.Date("2024-06-30"))) %>%
  filter(flat_type == "5 ROOM") %>%
  select(sale_date, town, flat_type, address, resale_price, floor_area_sqm, floor_ord, remaining_lease_val, age_unit, mup_status, 
         hip_announced, hip_completed, PROX_CBD, PROX_ELDERCARE, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_GDPRISCH, 
         PROX_MALL, PROX_SUPERMARKET, WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS, WITHIN_1KM_PRISCH)

# Save output to rds
write_rds(resale_flt_sf, "data/processed/resale_flt_sf.rds")

3 Exploratory Data Analysis

Before proceeding to calibrate a predictive model, it is always good to do some exploratory data analysis to better understand the data. I’ll clear the R console to save memory and load in the processed data.

# Clear R console
rm(list = ls(all.names = TRUE))

# Load resale data
resale_sf <- read_rds("data/processed/resale_flt_sf.rds")

# Load mpsz data
mpsz19_shp <- st_read(dsn = "data/geospatial/MPSZ-2019/", layer = "MPSZ-2019") %>% 
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `C:\Users\Henry\Desktop\SMU Masters\2024-2025 T1\Geospatial Analytics & Applications\Project\GeospatialWebsite\Take-Home_Ex\Take-Home_Ex_03\data\geospatial\MPSZ-2019' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

3.1 Resale Price

First, I’ll look at the “resale_price”, since it is the all important dependent attribute for the predictive model. gglot2 package will be used to plot a histogram illustrating the distribution for the “resale_price”.

# Plot distribution of resale_price
ggplot(data=resale_sf, aes(x=`resale_price`)) +
  geom_histogram(bins=20, color="black", fill="light blue") + 
  ggtitle("Distribution of Resale Prices") + 
  theme(plot.title = element_text(hjust = 0.5))

Since it is skewed to the right, I can try to center the distribution with a log transformation. The result will be visualized again with ggplot2 to check the distribution.

# Create new log variable
resale_sf <- resale_sf %>%
  mutate(`log_resale_price` = log(resale_price))

# Visualize results
ggplot(data=resale_sf, aes(x=`log_resale_price`)) +
  geom_histogram(bins=20, color="black", fill="light blue") + 
  ggtitle("Distribution of Log Resale Prices") + 
  theme(plot.title = element_text(hjust = 0.5))

The resale price can also be visualized with an interactive map.

# Turn on interactive mode
tmap_mode("view")
tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE) 

# Create interactive point symbol map
tm_shape(mpsz19_shp)+
  tm_polygons() +
tm_shape(resale_sf) +  
  tm_dots(col = "resale_price",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,12))
Warning: The shape mpsz19_shp is invalid (after reprojection). See
sf::st_is_valid
Note

Looking at the map, it seems that HDB flats tend to be much more expensive around the downtown area. Interestingly, units towards North-East and East are much noticeably more expensive than those towards the North or West.

3.2 Locational Factors

Now, I’ll proceed to look at the distribution of the various locational factors.

# Create 12 histograms
PROX_CBD <- ggplot(data=resale_sf, aes(x= `PROX_CBD`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_ELDERLYCARE <- ggplot(data=resale_sf, aes(x= `PROX_ELDERCARE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_HAWKER <- ggplot(data=resale_sf, aes(x= `PROX_HAWKER`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MRT <- ggplot(data=resale_sf, aes(x= `PROX_MRT`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_PARK <- ggplot(data=resale_sf, aes(x= `PROX_PARK`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_GDPRISCH <- ggplot(data=resale_sf, aes(x= `PROX_GDPRISCH`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MALL <- ggplot(data=resale_sf, aes(x= `PROX_MALL`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_SUPERMARKET <- ggplot(data=resale_sf, aes(x= `PROX_SUPERMARKET`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")


WITHIN_350M_KINDERGARTEN <- ggplot(data=resale_sf, aes(x= `WITHIN_350M_KINDERGARTEN`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

WITHIN_350M_CHILDCARE <- ggplot(data=resale_sf, aes(x= `WITHIN_350M_CHILDCARE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

WITHIN_350M_BUS <- ggplot(data=resale_sf, aes(x= `WITHIN_350M_BUS`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

WITHIN_1KM_PRISCH <- ggplot(data=resale_sf, aes(x= `WITHIN_1KM_PRISCH`)) +
  geom_histogram(bins=20, color="black", fill="light blue")


# Arrange the 12 historgrams into 3 columns by 4 rows
ggarrange(PROX_CBD, PROX_ELDERLYCARE, PROX_HAWKER, PROX_MRT, 
          PROX_PARK, PROX_GDPRISCH, PROX_MALL, PROX_SUPERMARKET,
          WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS, WITHIN_1KM_PRISCH,
          ncol = 3, nrow = 4)

Judging from the distributions, all proximity locational factors are somewhat right-skewed with the exception of “PROX_CBD” which is slightly left-skewed. For the other locational factors (“WITHIN…”), they seem to all have a somewhat normal distribution with the exception of “WITHIN_350M_KINDERGARTEN” which is more right-skewed. However, since I dont know if all the attributes are from some normal distribution, I will instead use normalise() from heatmaply package to standardise the values. Following which, I’ll plot the histograms again to verfiy this.

# Normalize variables for proximity factors
resale_sf <- resale_sf %>%
  mutate(std_PROX_CBD = normalize(PROX_CBD),
         std_PROX_ELDERCARE = normalize(PROX_ELDERCARE),
         std_PROX_HAWKER = normalize(PROX_HAWKER),
         std_PROX_MRT = normalize(PROX_MRT),
         std_PROX_PARK = normalize(PROX_PARK),
         std_PROX_GDPRISCH = normalize(PROX_GDPRISCH),
         std_PROX_MALL = normalize(PROX_MALL),
         std_PROX_SUPERMARKET = normalize(PROX_SUPERMARKET),
         std_WITHIN_350M_KINDERGARTEN = normalize(WITHIN_350M_KINDERGARTEN),
         std_WITHIN_350M_CHILDCARE = normalize(WITHIN_350M_CHILDCARE),
         std_WITHIN_350M_BUS = normalize(WITHIN_350M_BUS), 
         std_WITHIN_1KM_PRISCH = normalize(WITHIN_1KM_PRISCH))

# Create 12 histograms
PROX_CBD <- ggplot(data=resale_sf, aes(x= `std_PROX_CBD`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_ELDERLYCARE <- ggplot(data=resale_sf, aes(x= `std_PROX_ELDERCARE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_HAWKER <- ggplot(data=resale_sf, aes(x= `std_PROX_HAWKER`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MRT <- ggplot(data=resale_sf, aes(x= `std_PROX_MRT`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_PARK <- ggplot(data=resale_sf, aes(x= `std_PROX_PARK`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_GDPRISCH <- ggplot(data=resale_sf, aes(x= `std_PROX_GDPRISCH`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MALL <- ggplot(data=resale_sf, aes(x= `std_PROX_MALL`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_SUPERMARKET <- ggplot(data=resale_sf, aes(x= `std_PROX_SUPERMARKET`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

WITHIN_350M_KINDERGARTEN <- ggplot(data=resale_sf, aes(x= `std_WITHIN_350M_KINDERGARTEN`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

WITHIN_350M_CHILDCARE <- ggplot(data=resale_sf, aes(x= `std_WITHIN_350M_CHILDCARE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

WITHIN_350M_BUS <- ggplot(data=resale_sf, aes(x= `std_WITHIN_350M_BUS`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

WITHIN_1KM_PRISCH <- ggplot(data=resale_sf, aes(x= `std_WITHIN_1KM_PRISCH`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

# Arrange the 12 historgrams into 3 columns by 4 rows
ggarrange(PROX_CBD, PROX_ELDERLYCARE, PROX_HAWKER, PROX_MRT, 
          PROX_PARK, PROX_GDPRISCH, PROX_MALL, PROX_SUPERMARKET,
          WITHIN_350M_KINDERGARTEN, WITHIN_350M_CHILDCARE, WITHIN_350M_BUS, WITHIN_1KM_PRISCH,
          ncol = 3, nrow = 4)

3.2 Structural Factors

Now I’ll look at the structural factors.

# Create 4 histograms
UNIT_AREA <- ggplot(data=resale_sf, aes(x= `floor_area_sqm`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

UNIT_FLOOR <- ggplot(data=resale_sf, aes(x= `floor_ord`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

UNIT_LEASE <- ggplot(data=resale_sf, aes(x= `remaining_lease_val`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

UNIT_AGE <- ggplot(data=resale_sf, aes(x= `age_unit`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

# Arrange the 4 historgrams into 2 columns by 2 rows
ggarrange(UNIT_AREA, UNIT_FLOOR, UNIT_LEASE, UNIT_AGE, 
          ncol = 2, nrow = 2)

From the charts, I can tell that the unit area follows a somewhat normal distribution. The floor level is understandably right-skewed as there not many high-rise HDB buildings proportionately. Remaining lease and unit age then seem to have a slightly bimodal distribution. Regardless, I’ll do the same standardization process for them with normalize() from heatmaply package.

# Normalize variables for proximity factors
resale_sf <- resale_sf %>%
  mutate(std_UNIT_AREA = normalize(floor_area_sqm),
         std_UNIT_FLOOR = normalize(floor_ord),
         std_UNIT_LEASE = normalize(remaining_lease_val),
         std_UNIT_AGE = normalize(age_unit))

# Create 4 histograms
UNIT_AREA <- ggplot(data=resale_sf, aes(x= `std_UNIT_AREA`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

UNIT_FLOOR <- ggplot(data=resale_sf, aes(x= `std_UNIT_FLOOR`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

UNIT_LEASE <- ggplot(data=resale_sf, aes(x= `std_UNIT_LEASE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

UNIT_AGE <- ggplot(data=resale_sf, aes(x= `std_UNIT_AGE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

# Arrange the 4 historgrams into 2 columns by 2 rows
ggarrange(UNIT_AREA, UNIT_FLOOR, UNIT_LEASE, UNIT_AGE, 
          ncol = 2, nrow = 2)

To analyse the MUP/HIP status, I’ll create a categorical variable. prop.table() will then be used to display the proportions.

# Combine MUP/HIP
resale_sf <- resale_sf %>%
  mutate(mup_hip_status = case_when(mup_status == 1 ~ 1,
                                    hip_announced == 1 ~ 2,
                                    hip_completed == 1 ~ 3,
                                    TRUE ~ 0))

# Show proportions
prop.table(table(resale_sf$mup_hip_status))

         0          1          2          3 
0.73857571 0.03476110 0.07863559 0.14802760 
Note

From this, I can see that majority of the HDB flats (74%) have not undergone MUP/HIP. This could prove to be interesting if it does indeed have a significant impact on the resale price.

3.3 Compare Train and Test Datasets

A common understanding is that resale prices are strongly tied to their location. Therefore, it would be important to check the location proportions of both training and test datasets are relatively similar. If there are major differences, this could cause model performance issues as models calibrated on the training dataset may be focused towards specific locations that are less represented in the test dataset.

To do this, some data wrangling is required to get the proportion of towns in both datasets. This will then be visualized with ggplot2 package.

# Create train data town statistics
train_town <- resale_sf %>%
  filter(year(sale_date) == 2023) %>%
  group_by(town) %>%
  summarise(count = n(),
            proportion = round(n()/nrow(.)*100, 2)) %>%
  mutate(source = "Train")

# Create test data atown statistics
test_town <- resale_sf %>%
  filter(year(sale_date) != 2023) %>%
  group_by(town) %>%
  summarise(count = n(),
            proportion = round(n()/nrow(.)*100, 2)) %>%
  mutate(source = "Test")

# Combine both statistics
town_summary <- bind_rows(train_town, test_town)

# Visualize statistics
ggplot(town_summary, aes(x = reorder(town, -proportion), y = proportion, fill = source)) +
  geom_bar(stat = "identity", position = "dodge") +  
  labs(
    title = "Proportion of Observations by Town (Comparison)",
    x = "Town",
    y = "Proportion (%)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +  # Rotate labels for readability
  scale_fill_manual(values = c("skyblue", "salmon")) 

Note

Although the test data has a much greater proportion of Sengkang and lesser proportion of Sembawang observations, the proportions for both train and test datasets are relatively similar. Therefore, I shouldn’t worry too much about class imbalance issues.

3.4 Multicolinearity Analysis

Finally, I will do a last check for multicolinearity through a correlation matrix. To do so, I would remove the geometry data with st_drop_geometry(). ggcorrmat() from ggstatsplot package will then be used to plot the matrix.

# Drop geometry
resale_nogeo <- resale_sf %>%
  st_drop_geometry()

# List of variables
corr_list <- c("std_PROX_CBD", "std_PROX_ELDERCARE", "std_PROX_HAWKER", "std_PROX_MRT", 
               "std_PROX_PARK", "std_PROX_GDPRISCH", "std_PROX_MALL", "std_PROX_SUPERMARKET",
               "std_WITHIN_350M_KINDERGARTEN", "std_WITHIN_350M_CHILDCARE", "std_WITHIN_350M_BUS",
               "std_WITHIN_1KM_PRISCH", "std_UNIT_AREA", "std_UNIT_FLOOR", "std_UNIT_LEASE", "std_UNIT_AGE",
               "mup_status", "hip_announced", "hip_completed")
  

# Build correlation matrix
ggcorrmat(resale_nogeo, corr_list)

Note

Unsurprisingly, age of unit is inversely correlated (-1) with remaining lease of unit since age is computed from remaining lease. Therefore, I will not use age of unit when building the model in the subsequent steps. Looking at the other values, they do not appear to have very strong correlation (>0.5) and hence it should be fine to proceed to build the model with those values.

4 Multiple Linear Regression

Before proceeding, I will use st_jitter() to ensure that the points to not overlap. 2 meters is used in this case as this will ensure some jittering happens (with rounding) while still having the points remain within the building.

# Jitter points
resale_sf <- resale_sf %>%
  st_jitter(amount = 2)

Also, I will split the data into train and test sets

# Create resale train sf
resale_train_sf <- resale_sf %>%
  filter(year(sale_date) == 2023)

# Create resale test sf
resale_test_sf <- resale_sf %>%
  filter(year(sale_date) != 2023)

4.1 Calibrate Non-Spatial Model

Using lm(), I will build a multiple linear regression model. This can be check with ols_regress() from olsrr package.

# Build model
resale_mlr <- lm(formula = resale_price ~ std_PROX_CBD + std_PROX_ELDERCARE +
                   std_PROX_HAWKER + std_PROX_MRT + std_PROX_PARK + std_PROX_GDPRISCH +
                   std_PROX_MALL + std_PROX_SUPERMARKET + std_WITHIN_350M_KINDERGARTEN +
                   std_WITHIN_350M_CHILDCARE + std_WITHIN_350M_BUS + std_WITHIN_1KM_PRISCH +
                   std_UNIT_AREA + std_UNIT_FLOOR + std_UNIT_LEASE +
                   mup_status + hip_announced + hip_completed,
                 data = resale_train_sf)


# Check model with olsrr
olsrr::ols_regress(resale_mlr)
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.854       RMSE                    73575.682 
R-Squared                   0.729       MSE                5431041346.346 
Adj. R-Squared              0.728       Coef. Var                  10.754 
Pred R-Squared              0.727       AIC                    147575.848 
MAE                     54730.438       SBC                    147709.308 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                    ANOVA                                      
------------------------------------------------------------------------------
                    Sum of                                                    
                   Squares          DF       Mean Square       F         Sig. 
------------------------------------------------------------------------------
Regression    8.508485e+13          18      4.726936e+12    870.355    0.0000 
Residual      3.163038e+13        5824    5431041346.346                      
Total         1.167152e+14        5842                                        
------------------------------------------------------------------------------

                                                   Parameter Estimates                                                    
-------------------------------------------------------------------------------------------------------------------------
                       model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
-------------------------------------------------------------------------------------------------------------------------
                 (Intercept)     670044.043     10829.136                  61.874    0.000     648814.916     691273.171 
                std_PROX_CBD    -388010.122      6073.367       -0.584    -63.887    0.000    -399916.177    -376104.066 
          std_PROX_ELDERCARE      -9179.524      5695.395       -0.013     -1.612    0.107     -20344.613       1985.564 
             std_PROX_HAWKER     -82675.208      6032.947       -0.115    -13.704    0.000     -94502.025     -70848.391 
                std_PROX_MRT    -116766.699      6150.883       -0.146    -18.984    0.000    -128824.715    -104708.683 
               std_PROX_PARK      -8764.112      5472.541       -0.012     -1.601    0.109     -19492.324       1964.100 
           std_PROX_GDPRISCH     -33475.750      5807.999       -0.046     -5.764    0.000     -44861.586     -22089.914 
               std_PROX_MALL     -67790.809      7440.154       -0.072     -9.111    0.000     -82376.275     -53205.343 
        std_PROX_SUPERMARKET      -7093.849      9635.483       -0.005     -0.736    0.462     -25982.974      11795.276 
std_WITHIN_350M_KINDERGARTEN      78153.274      8408.860        0.074      9.294    0.000      61668.785      94637.762 
   std_WITHIN_350M_CHILDCARE     -80973.076     11141.566       -0.062     -7.268    0.000    -102814.684     -59131.468 
         std_WITHIN_350M_BUS      21051.056      6639.013        0.024      3.171    0.002       8036.124      34065.987 
       std_WITHIN_1KM_PRISCH    -123552.833      7345.628       -0.146    -16.820    0.000    -137952.992    -109152.673 
               std_UNIT_AREA     394750.866     11093.281        0.333     35.585    0.000     373003.915     416497.817 
              std_UNIT_FLOOR     288672.700      7304.638        0.290     39.519    0.000     274352.897     302992.502 
              std_UNIT_LEASE     351743.956      7517.271        0.610     46.791    0.000     337007.312     366480.599 
                  mup_status      26347.973      7022.651        0.034      3.752    0.000      12580.969      40114.976 
               hip_announced      40668.871      4306.686        0.077      9.443    0.000      32226.167      49111.574 
               hip_completed      25108.497      4367.483        0.061      5.749    0.000      16546.609      33670.386 
-------------------------------------------------------------------------------------------------------------------------

Looking at the results, I will remove the following and re-run the model:

  • “std_PROX_ELDERCARE” - not statistically significant (0.107 > alpha of 0.05)
  • “std_PROX_PARK” - not statistically significant (0.109 > alpha of 0.05)
  • “std_PROX_SUPERMARKET” - not statistically significant (0.462 > alpha of 0.05)
# Build model
resale_mlr <- lm(formula = resale_price ~ std_PROX_CBD +
                   std_PROX_HAWKER + std_PROX_MRT + std_PROX_GDPRISCH +
                   std_PROX_MALL + std_WITHIN_350M_KINDERGARTEN +
                   std_WITHIN_350M_CHILDCARE + std_WITHIN_350M_BUS + std_WITHIN_1KM_PRISCH +
                   std_UNIT_AREA + std_UNIT_FLOOR + std_UNIT_LEASE +
                   mup_status + hip_announced + hip_completed,
                 data = resale_train_sf)


# Check model with olsrr
olsrr::ols_regress(resale_mlr)
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.854       RMSE                    73604.278 
R-Squared                   0.729       MSE                5432465553.102 
Adj. R-Squared              0.728       Coef. Var                  10.755 
Pred R-Squared              0.727       AIC                    147574.389 
MAE                     54693.745       SBC                    147687.830 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                     ANOVA                                      
-------------------------------------------------------------------------------
                    Sum of                                                     
                   Squares          DF       Mean Square       F          Sig. 
-------------------------------------------------------------------------------
Regression    8.506026e+13          15      5.670684e+12    1043.851    0.0000 
Residual      3.165498e+13        5827    5432465553.102                       
Total         1.167152e+14        5842                                         
-------------------------------------------------------------------------------

                                                   Parameter Estimates                                                    
-------------------------------------------------------------------------------------------------------------------------
                       model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
-------------------------------------------------------------------------------------------------------------------------
                 (Intercept)     665164.793     10251.378                  64.885    0.000     645068.286     685261.300 
                std_PROX_CBD    -393902.159      5386.962       -0.593    -73.121    0.000    -404462.605    -383341.713 
             std_PROX_HAWKER     -81213.265      5910.770       -0.113    -13.740    0.000     -92800.569     -69625.961 
                std_PROX_MRT    -118382.491      6104.164       -0.148    -19.394    0.000    -130348.918    -106416.064 
           std_PROX_GDPRISCH     -34561.985      5729.876       -0.047     -6.032    0.000     -45794.668     -23329.302 
               std_PROX_MALL     -68653.047      7285.288       -0.073     -9.424    0.000     -82934.917     -54371.178 
std_WITHIN_350M_KINDERGARTEN      78336.932      8364.873        0.075      9.365    0.000      61938.676      94735.189 
   std_WITHIN_350M_CHILDCARE     -79763.133     11083.765       -0.061     -7.196    0.000    -101491.426     -58034.840 
         std_WITHIN_350M_BUS      21697.593      6612.864        0.025      3.281    0.001       8733.926      34661.260 
       std_WITHIN_1KM_PRISCH    -123511.340      7279.866       -0.146    -16.966    0.000    -137782.580    -109240.100 
               std_UNIT_AREA     395573.296     10846.313        0.333     36.471    0.000     374310.496     416836.096 
              std_UNIT_FLOOR     288845.789      7282.994        0.290     39.660    0.000     274568.417     303123.161 
              std_UNIT_LEASE     353643.092      7237.791        0.614     48.861    0.000     339454.336     367831.848 
                  mup_status      28575.384      6894.719        0.037      4.145    0.000      15059.175      42091.592 
               hip_announced      41612.513      4277.459        0.079      9.728    0.000      33227.105      49997.921 
               hip_completed      26575.289      4277.114        0.065      6.213    0.000      18190.559      34960.020 
-------------------------------------------------------------------------------------------------------------------------

Since all the independent variables are statistically significant, I will proceed to the subsequent steps after saving this model as a rds object.

# Save model as rds
write_rds(resale_mlr, "data/model/resale_mlr.rds")

4.2 Multicolinearity Check

Using check_collinearity() from the performance package, I can compute the VIF scores for the resale multiple linear regression model. This can then be shown in a table using the kable package.

# Check multicolinearity
vif <- performance::check_collinearity(resale_mlr)

# Round values to 5 decimal places
vif[] <- lapply(vif, function(x) if(is.numeric(x)) round(x, 5) else x)

# Format output in a table
datatable(vif, rownames=FALSE, width="100%",
            options=list(pageLength = 15,
                         scrollX = TRUE, 
                         initComplete = JS(
                           "function(settings, json) {",
                           "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff', 'font-family': 'sans-serif', 'font-size': '12px'});",
                           "$(this.api().table().body()).css({'font-family': 'sans-serif', 'font-size': '12px'});",
                           "}"
    )))

Using theme() from ggplot2, I can also visualize the VIF scores in a graph.

# Plot vif
plot(vif) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Note

Since VIF of the independent variables are all less than 10, I can conclude that there are no sign of multicollinearity.

4.3 Additional Tests

Beside multicolinearity, I would also need to run other tests

4.3.1 Non-Linearity Test

# Check non-linearity
ols_plot_resid_fit(resale_mlr)

Note

Since most of the data points are scattered around the 0 line, I can conclude that the dependent and independent variables have a linear relationship.

4.3.2 Normality Test

Using ols_plot_resid_hist() from olsrr package, I will test the normality assumption.

# Check nomality
ols_plot_resid_hist(resale_mlr)

Note

The plot shows that the residual of the multiple linear regression model resembles normal distribution. However, further tests are required.

Given that the Shapiro-Wilk test has a sample size limit of 5000, I will have to sample a portion of the residuals from “resale_mlr”. Since sampling is involved, I will need to set a seed to ensure reproducibility.

# Set seed
set.seed(42)

# Check normality
ols_test_normality(sample(residuals(resale_mlr), size = 5000))
-----------------------------------------------
       Test             Statistic       pvalue  
-----------------------------------------------
Shapiro-Wilk              0.9701         0.0000 
Kolmogorov-Smirnov        0.0519         0.0000 
Cramer-von Mises         417.3867        0.0000 
Anderson-Darling         24.2327         0.0000 
-----------------------------------------------
Note

Since the p-values of the tests are way smaller than alpha value of 0.05, I can reject the null hypothesis and conclude that there is statistical evidence that the residual are not normally distributed.

4.3.3 Spatial Autocorrelation Test

Before performing spatial autocorrelation test on the residuals of the multiple linear regression model, it is good to visualize the residuals on a map. To do so, I would need to extract the residuals from the model and merge it with the resale_sf dataframe.

# Extract residuals of mlr
mlr_res <- as.data.frame(resale_mlr$residuals)

# Join residuals with resale_sf
resale_res_sf <- cbind(resale_train_sf, mlr_res) %>%
  rename(`MLR_RES` = `resale_mlr.residuals`)


# Set tmap to plot
tmap_mode("plot")

tmap_options(check.and.fix = TRUE)

# Visualize residuals
tm_shape(mpsz19_shp)+
  tm_polygons(alpha = 0.4) +
tm_shape(resale_res_sf) +  
  tm_dots(col = "MLR_RES", alpha = 0.6, style = "quantile", size = 0.2) +
  tm_layout(main.title = "Multiple Linear Regression Residuals",     
            main.title.position = "center",                       
            main.title.size = 1.5                                  
            )

Note

Judging from the map, it looks like there are some clustering which are signs of spatial autocorrelation.

To conclusively prove spatial autocorrelation, I will use Moran’s I test. As the sfdep package lacks a way to do this for a linear model, I would need to use spdep package. However, I would then need to:

  • Convert the sf dataframe to spatial point dataframe with as_Spatial() from spdep package.
  • Create the distance-based neighbour list with dnearneigh() from spdep package.
  • Convert neighbour list into spatial weights with nb2listw() from spdep package.
# Convert simple features object to spatialpointsdataframe
resale_sp <- as_Spatial(resale_res_sf)

# Create neighbour list
nb <- dnearneigh(coordinates(resale_sp), 0, 1500, longlat = FALSE)

# Create spatial weights matrix
nb_lw <- nb2listw(nb, style = 'W')

Finally, I can run perform the Moran’s I test with lm.morantest() from spdep package. This will be saved as a rds object.

# Perform Moran's I test
mlr_moran <- lm.morantest(resale_mlr, nb_lw)

# Save output to rds
write_rds(mlr_moran, "data/model/mlr_moran.rds")

The results can then be loaded for us to check.

# Load Moran's I test results
read_rds("data/model/mlr_moran.rds")

    Global Moran I for regression residuals

data:  
model: lm(formula = resale_price ~ std_PROX_CBD + std_PROX_HAWKER + std_PROX_MRT + std_PROX_GDPRISCH + std_PROX_MALL + std_WITHIN_350M_KINDERGARTEN +
std_WITHIN_350M_CHILDCARE + std_WITHIN_350M_BUS + std_WITHIN_1KM_PRISCH + std_UNIT_AREA + std_UNIT_FLOOR + std_UNIT_LEASE + mup_status + hip_announced +
hip_completed, data = resale_train_sf)
weights: nb_lw

Moran I statistic standard deviate = 229.13, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    3.134660e-01    -1.024189e-03     1.883935e-06 
Note

The Global Moran’s I test shows that it’s p-value is less than the alpha value of 0.05. Therefore I will reject the null hypothesis that the residuals are randomly distributed.

Since the Observed Global Moran I = 0.3135 > 0, I can infer than the residuals resemble cluster distribution.

5 Geographically Random Forest

Before running any models, I will need to prepare the coordinates data as it is required by the SpatialML package. This can be done with st_coordinates() from sf package. After which, I will remove the geometry data from the resale_train_sf with st_drop_geometry() from sf package to create an aspatial dataset.

# Get coordinates from full, training and test data
coords <- st_coordinates(resale_sf)
coords_train <- st_coordinates(resale_train_sf)
coords_test <- st_coordinates(resale_test_sf)

# Save coordinates for easy retrieval
write_rds(coords, "data/model/coords.rds")
write_rds(coords_train, "data/model/coords_train.rds")
write_rds(coords_test, "data/model/coords_test.rds")

# Drop geometry
resale_train_nogeo <- resale_train_sf %>% 
  st_drop_geometry()

5.1 Calibrate Random Forest Model

To better highlight the differences between a geographical weighted model and a non-spatial model, I will first calibrate a random forest model using the ranger package. The seed is also set to ensure reproducibility. Parameters for this rf model will follow the grf model closely to better highlight the influence of geographical factors.

Once done, this will be saved as a rds object to facilitate future retrieval.

# Set seed
set.seed(42)

# Calibrate random forest model
rf <- ranger(resale_price ~ std_PROX_CBD + # std_PROX_ELDERCARE +
               std_PROX_HAWKER + std_PROX_MRT + std_PROX_GDPRISCH + # std_PROX_PARK + 
               std_PROX_MALL + std_WITHIN_350M_KINDERGARTEN + # + std_PROX_SUPERMARKET
               std_WITHIN_350M_CHILDCARE + std_WITHIN_350M_BUS + std_WITHIN_1KM_PRISCH +
               std_UNIT_AREA + std_UNIT_FLOOR + std_UNIT_LEASE +
               mup_status + hip_announced + hip_completed,
             data = resale_train_nogeo,
             num.trees = 50, # set to a small number to manage complexity 
             mtry = 2, # default - p/3 ~ 5
             verbose = TRUE,
             importance = "impurity" 
             )

# Save ranger output
write_rds(rf, "data/model/rf.rds")

Now, I can load the rf model to check the output.

# Load ranger output
rf <- read_rds("data/model/rf.rds")

# Check output
rf
Ranger result

Call:
 ranger(resale_price ~ std_PROX_CBD + std_PROX_HAWKER + std_PROX_MRT +      std_PROX_GDPRISCH + std_PROX_MALL + std_WITHIN_350M_KINDERGARTEN +      std_WITHIN_350M_CHILDCARE + std_WITHIN_350M_BUS + std_WITHIN_1KM_PRISCH +      std_UNIT_AREA + std_UNIT_FLOOR + std_UNIT_LEASE + mup_status +      hip_announced + hip_completed, data = resale_train_nogeo,      num.trees = 50, mtry = 2, verbose = TRUE, importance = "impurity") 

Type:                             Regression 
Number of trees:                  50 
Sample size:                      5843 
Number of independent variables:  15 
Mtry:                             2 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       2201977628 
R squared (OOB):                  0.8897834 

5.2 Calibrate Geographical Random Forest Model

For the Geographical Random Forest model, I have decided to use an adaptive bandwidth as it is better able to capture local patterns and handle differences in data distributions across Singapore. To determine the optimal adaptive bandwidth, bw.gwr() from GWmodel package will be used, with the approach being set to cross-validation (“CV”). The SpatialPointDataFrame created earlier in 4.3.3 can also be reused. After this is completed, the bandwidth will be saved as a rds object.

# Compute adaptive bandwidth 
bw_adaptive <- bw.gwr(formula = resale_price ~ std_PROX_CBD + # std_PROX_ELDERCARE +
                        std_PROX_HAWKER + std_PROX_MRT + std_PROX_GDPRISCH + # std_PROX_PARK + 
                        std_PROX_MALL + std_WITHIN_350M_KINDERGARTEN + # + std_PROX_SUPERMARKET
                        std_WITHIN_350M_CHILDCARE + std_WITHIN_350M_BUS + std_WITHIN_1KM_PRISCH +
                        std_UNIT_AREA + std_UNIT_FLOOR + std_UNIT_LEASE +
                        mup_status + hip_announced + hip_completed,
                      data=resale_sp, 
                      approach="CV", 
                      kernel="gaussian", 
                      adaptive=TRUE, 
                      longlat=FALSE)

# Save adaptive bandwidth
write_rds(bw_adaptive, "data/model/bw_adaptive.rds")

Now, I can load the adaptive bandwidth to check.

# Load adaptive bandwidth
bw_adaptive <- read_rds("data/model/bw_adaptive.rds")

# Check adaptive bandwidth
bw_adaptive
[1] 72
Note

The optimal adaptive bandwidth from cross-validation is determined to be 72.

Finally, I can calibrate the geographical random forest model with grf() from SpatialML package. To reduce complexity while still having a robust model, I have set “ntree” to 50 and “mtry” to 2. Once done, this will be saved as a rds object.

# Set seed
set.seed(42)

# Calibrate geographic random forest model
gwRF_adaptive <- grf(formula = resale_price ~ std_PROX_CBD + 
                        std_PROX_HAWKER + std_PROX_MRT + std_PROX_GDPRISCH + 
                        std_PROX_MALL + std_WITHIN_350M_KINDERGARTEN + 
                        std_WITHIN_350M_CHILDCARE + std_WITHIN_350M_BUS + std_WITHIN_1KM_PRISCH +
                        std_UNIT_AREA + std_UNIT_FLOOR + std_UNIT_LEASE +
                        mup_status + hip_announced + hip_completed,
                     dframe=resale_train_nogeo, 
                     bw = bw_adaptive,
                     ntree = 50, # default - 500
                     mtry = 2, # default - p/3 ~ 5
                     kernel="adaptive",
                     nthreads = 12,
                     verbose = TRUE,
                     coords=coords_train)

# Save model output
write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")

6 Model Evaluation

Since this is a predictive model, I am more interested in regression metrics (i.e. root mean square error) rather than fit (i.e. R square) which is more suitable for models seeking to explain. First, I’ll load back all the models that have been calibrated in the previous steps.

# Load mlr model
model_mlr <- read_rds("data/model/resale_mlr.rds")

# Load random forest model
model_rf <- read_rds("data/model/rf.rds")

# Load geographically random forest model
model_grf <- read_rds("data/model/gwRF_adaptive.rds")

6.1 Predicting with Test Data

To evaluate the different models, I would need to know their performance on the test data. Therefore, I will first need to generate predictions on the test data for each of those models.

6.1.1 Multiple Linear Regression Model

To predict with the multiple linear regression ordinary least squares model, I can use predict().

# Predict with mlr
resale_test_sf$predict_mlr <- predict(object = model_mlr, newdata = resale_test_sf)

6.1.2 Random Forest Model

For the random forest model, I will similarly use predict(), but I will have to remove the geometry data with st_drop_geometry() from sf package. The predictions can be found within the “predictions” in the output object.

# Remove geometry data from resale_test_sf
resale_test_nogeo <- resale_test_sf %>%
  st_drop_geometry()

# Predict with rf
rf_pred <- predict(model_rf, data = resale_test_nogeo)

# Extract and append predictions to resale_test_sf
resale_test_sf$predict_rf <- rf_pred$predictions

6.1.3 Geographical Random Forest Model

For the geographical random forest model, I will need to do an additional step - combining the coordinates data to the nogeo test data. The predictions will be computed with predict.grf() from SpatialML package and saved as a rds object.

# Combine test data with coordinates
gwRF_test_data <- cbind(resale_test_nogeo, coords_test) 

# Compute predicted values
gwRF_pred <- predict.grf(model_grf, 
                         gwRF_test_data, 
                         x.var.name="X",
                         y.var.name="Y", 
                         local.w=1,
                         global.w=0,
                         nthreads = 12)

# Save output
write_rds(gwRF_pred, "data/model/GRF_pred.rds")

Now I can load it back in and append the predictions to the resale_test_sf. The resale_test_sf with all the predictions from the different models will now be saved as a rds object.

# Load output
grf_pred <- read_rds("data/model/GRF_pred.rds")

# Combine predicted values with test data
resale_test_sf$predict_grf <- grf_pred

# Save output
write_rds(resale_test_sf, "data/model/resale_test_sf.rds")

6.2 Metrics

To compute the metrics, I will load in the resale_test_sf with all the predict values and compute 2 regression metrics for them

  • RMSE: Root Mean Square Error
  • MAE: Mean Absolute Error

This will be visualized with datatable() from DT package.

# Load in resale_test_sf
resale_test_sf <- read_rds("data/model/resale_test_sf.rds")

# Compute rmse for the different models
rmse_mlr <- Metrics::rmse(resale_test_sf$resale_price, resale_test_sf$predict_mlr)
rmse_rf <- Metrics::rmse(resale_test_sf$resale_price, resale_test_sf$predict_rf)
rmse_grf <- Metrics::rmse(resale_test_sf$resale_price, resale_test_sf$predict_grf)

# Compute mae for the different models
mae_mlr <- Metrics::mae(resale_test_sf$resale_price, resale_test_sf$predict_mlr)
mae_rf <- Metrics::mae(resale_test_sf$resale_price, resale_test_sf$predict_rf)
mae_grf <- Metrics::mae(resale_test_sf$resale_price, resale_test_sf$predict_grf)

# Combine metrics into a dataframe
metrics_df <- data.frame(
  Multiple_Linear_Regression = c(rmse_mlr, mae_mlr),
  Random_Forest = c(rmse_rf, mae_rf),
  Geographically_Weighted_RF = c(rmse_grf, mae_grf)
)

# Assign rownames
rownames(metrics_df) <- c("RMSE", "MAE")

# Visualize results
datatable(metrics_df, rownames=TRUE, width="100%",
            options=list(scrollX = TRUE, 
                         initComplete = JS(
                           "function(settings, json) {",
                           "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff', 'font-family': 'sans-serif', 'font-size': '12px'});",
                           "$(this.api().table().body()).css({'font-family': 'sans-serif', 'font-size': '12px'});",
                           "}"
    )))

However, I also want to know how the models fare across different locations. To do so, I will first use st_join() from sf package to get the “PLN_AREA_N” information from “mpsz19_shp” dataset. RMSE and MAE are then performed for all models within each “PLN_AREA_N”, with the best model having the smallest value for each metric.

This is then joined back to “mpsz19_shp” dataset which is summarized by “PLN_AREA_N” with st_union() from sf package after a group_by().

# Join Planning Area from MPSZ data to resale_test_sf
resale_with_area <- st_join(resale_test_sf, mpsz19_shp["PLN_AREA_N"])

# Calculate RMSE and MAE by PLN_AREA_N for each model
rmse_by_area <- resale_with_area %>%
  group_by(PLN_AREA_N) %>%
  summarise(
    RMSE_MLR = Metrics::rmse(resale_price, predict_mlr),
    RMSE_RF = Metrics::rmse(resale_price, predict_rf),
    RMSE_GRF = Metrics::rmse(resale_price, predict_grf),
    MAE_MLR = Metrics::mae(resale_price, predict_mlr),
    MAE_RF = Metrics::mae(resale_price, predict_rf),
    MAE_GRF = Metrics::mae(resale_price, predict_grf)
  ) %>%
  mutate(Best_RMSE = case_when(RMSE_MLR == pmin(RMSE_MLR, RMSE_RF, RMSE_GRF) ~ "MLR",
                               RMSE_RF == pmin(RMSE_MLR, RMSE_RF, RMSE_GRF) ~ "RF",
                               RMSE_GRF == pmin(RMSE_MLR, RMSE_RF, RMSE_GRF) ~ "GRF"
                                 ),
         Best_MAE = case_when(MAE_MLR == pmin(MAE_MLR, MAE_RF, MAE_GRF) ~ "MLR",
                              MAE_RF == pmin(MAE_MLR, MAE_RF, MAE_GRF) ~ "RF",
                              MAE_GRF == pmin(MAE_MLR, MAE_RF, MAE_GRF) ~ "GRF"
                                 ),
  ) %>%
  st_drop_geometry()

# Join back to mpsz data
mpsz19_shp_with_metrics <- mpsz19_shp %>%
  group_by(PLN_AREA_N) %>%
  summarise(geometry = st_union(geometry)) %>% # Union geometries by PLN_AREA_N
  ungroup() %>%
  left_join(rmse_by_area, by = "PLN_AREA_N")

With the updated mpsz19_shp data, I can plot the 2 metrics performance side by side for each planning area.

# Set tmap mode to view
tmap_mode("plot")
tmap mode set to plotting
tmap_options(check.and.fix = TRUE)

# Visualize best rmse model by planning area
plot_rmse <- tm_shape(mpsz19_shp_with_metrics) +
  tm_polygons("Best_RMSE", 
              title = "Model", 
              palette = "Set2", 
              border.col = "gray",id = "PLN_AREA_N") +
  tm_layout(main.title = "Best Model by RMSE for Each Area",
            main.title.position = "center") +
  tm_view(set.zoom.limits = c(11, 12))

# Visualize best mae model by planning area
plot_mae <- tm_shape(mpsz19_shp_with_metrics) +
  tm_polygons("Best_MAE", 
              title = "Model", 
              palette = "Set2", 
              border.col = "gray",id = "PLN_AREA_N") +
  tm_layout(main.title = "Best Model by MAE for Each Area",
            main.title.position = "center") +
  tm_view(set.zoom.limits = c(11, 12))

# Plot the 2 maps side by side
tmap_arrange(plot_rmse, plot_mae, ncol = 2, sync = TRUE)
Warning: The shape mpsz19_shp_with_metrics is invalid. See sf::st_is_valid
Warning: The shape mpsz19_shp_with_metrics is invalid. See sf::st_is_valid

Note

Tighter

6.3 Visualizing Performance

6.3.1 Scatterplot

Using a scatterplot with ggplot2 package, I can check the performance of each model visually by plotting the actual resale prices against the predicted values. These are then arranged in a suitable format with ggarrange() from ggpubr package.

# Plot for Multiple Linear Regression
plot_mlr <- ggplot(data = resale_test_sf,
                   aes(x = predict_mlr,
                       y = resale_price)) +
  geom_point() + 
  ggtitle("Resale Prices vs Prediced Prices (MLR)") + 
  theme(plot.title = element_text(hjust = 0.5))

# Plot for Random Forest
plot_rf <- ggplot(data = resale_test_sf,
                   aes(x = predict_rf,
                       y = resale_price)) +
  geom_point() + 
  ggtitle("Resale Prices vs Prediced Prices (RF)") + 
  theme(plot.title = element_text(hjust = 0.5))

# Plot for Geographical Random Forest
plot_grf <- ggplot(data = resale_test_sf,
                   aes(x = predict_grf,
                       y = resale_price)) +
  geom_point() + 
  ggtitle("Resale Prices vs Prediced Prices (GRF)") + 
  theme(plot.title = element_text(hjust = 0.5))

# Visualize plots side by side
ggarrange(plot_mlr, plot_rf, plot_grf, ncol = 2, nrow = 2)

Note

Tighter

6.3.2 Prediction Errors Across Singapore

Residuals are calculated for each model, then I can visualize them with a boxplot with boxplot().

# Create residuals for each model prediction
resale_test_sf <- resale_test_sf %>%
  mutate(residual_mlr = resale_price - predict_mlr,
         residual_rf = resale_price - predict_rf,
         residual_grf = resale_price - predict_grf)

# Visualize boxplot
boxplot(resale_test_sf$residual_mlr, resale_test_sf$residual_rf, resale_test_sf$residual_grf,
        names = c("OLS MLR", "Random Forest", "Geographically Weighted RF"),
        main = "Residual Comparison Across Models",
        ylab = "Residuals",
        col = c("skyblue", "salmon", "lightgreen"))

Note

Tighter

To better understand each model, I will also visualize each model’s residuals across Singapore region. For cross comparison, the model’s RMSE performance across region will also be shown.

# Set tmap mode to plot
tmap_mode("view")
tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE)

# Visualize residuals for multiple linear regression
tm_shape(mpsz19_shp_with_metrics) +
  tm_polygons("RMSE_MLR", 
              palette = "YlOrRd",         
              title = "RMSE by Region") +  
  tm_shape(resale_test_sf) +
  tm_dots("residual_mlr",
          palette = "-RdBu",
          title = "Residual (Over/Underestimation)",
          size = 0.1,
          id = "address",
          breaks = c(-Inf, seq(-400000, 600000, by = 200000), Inf)) +
  tm_layout(main.title = "Residuals (MLR)",
            main.title.position = ("center")) +
  tm_view(set.zoom.limits = c(11, 12))
Warning: The shape mpsz19_shp_with_metrics is invalid (after reprojection). See
sf::st_is_valid
Variable(s) "residual_mlr" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Note

Analysis for the variable importance will be incorporated into the conclusion.

# Set tmap mode to view
tmap_mode("view")
tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE)

# Visualize residuals for random forest
tm_shape(mpsz19_shp_with_metrics) +
  tm_polygons("RMSE_RF", 
              palette = "YlOrRd",         
              title = "RMSE by Region") +  
  tm_shape(resale_test_sf) +
  tm_dots("residual_rf",
          palette = "-RdBu",
          title = "Residual (Over/Underestimation)",
          size = 0.1,
          id = "address",
          breaks = c(-Inf, seq(-400000, 600000, by = 200000), Inf)) +
  tm_layout(main.title = "Residuals (RF)",
            main.title.position = ("center")) +
  tm_view(set.zoom.limits = c(11, 12))
Warning: The shape mpsz19_shp_with_metrics is invalid (after reprojection). See
sf::st_is_valid
Variable(s) "residual_rf" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Note

Analysis for the variable importance will be incorporated into the conclusion.

# Set tmap mode to plot
tmap_mode("view")
tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE)

# Visualize residuals for geographical random forest
tm_shape(mpsz19_shp_with_metrics) +
  tm_polygons("RMSE_GRF", 
              palette = "YlOrRd",         
              title = "RMSE by Region") +  
  tm_shape(resale_test_sf) +
  tm_dots("residual_grf",
          palette = "-RdBu",
          title = "Residual (Over/Underestimation)",
          size = 0.1,
          id = "address",
          breaks = c(-Inf, seq(-400000, 600000, by = 200000), Inf)) +
  tm_layout(main.title = "Residuals (GRF)",
            main.title.position = ("center")) +
  tm_view(set.zoom.limits = c(11, 12))
Warning: The shape mpsz19_shp_with_metrics is invalid (after reprojection). See
sf::st_is_valid
Variable(s) "residual_grf" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Note

Analysis for the variable importance will be incorporated into the conclusion.

6.4 Variable Importance

While the focus is on prediction, it is always good to have a better understanding of the variable importance. This could be used to explain the model which is very helpful when selling the model to stakeholders. While model metrics are definitely important, stakeholders typically demand explanation to better trust the model.

Since the geographical random forest model is the model for choice, I will use the variable importance from that model. That can be found within the “Global.Model” of the model object. Using enframe() from tibble package, I will convert the named list into a tibble. Minor processing are done (arrange values, add thousand separator) before visualizing it with datatable() from DT package.

# Extract importance from geographical random forest model output
importance_tbl <- enframe(model_grf$Global.Model$variable.importance, name = "Factor", value = "Value")

# For
importance_tbl <- importance_tbl %>%
  arrange(desc(Value)) %>%
  mutate(Value = scales::comma(Value))

datatable(importance_tbl, rownames=TRUE, width="100%",
            options=list(scrollX = TRUE, 
                         initComplete = JS(
                           "function(settings, json) {",
                           "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff', 'font-family': 'sans-serif', 'font-size': '12px'});",
                           "$(this.api().table().body()).css({'font-family': 'sans-serif', 'font-size': '12px'});",
                           "}"
    )))
Note

Analysis for the variable importance will be incorporated into the conclusion.

7 Conclusion

Resale flat prices are a constant hot topic in Singapore, with demand for resale flats coming from growing families as well as investors.

References

Manuel Lehner (2011). Modelling housing prices in Singapore applying spatial hedonic regression. Retrieved from https://archiv.ivt.ethz.ch/docs/students/sa307.pdf